<a href="https://colab.research.google.com/github/alelorys/machineLearningUdemyBootcamp/blob/main/supervised/01_basics/03_feature_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### scikit-learn
Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  

Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Podstawowa biblioteka do uczenia maszynowego w języku Python.

Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install scikit-learn
```
Aby zaktualizować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install --upgrade scikit-learn
```
Kurs stworzony w oparciu o wersję `0.22.1`

### Spis treści:
1. [Import bibliotek](#0)
2. [Załadowanie danych](#1)
3. [Utworzenie kopii danych](#2)
4. [Generowanie nowych zmiennych](#3)
5. [Dyskretyzacja zmiennej ciągłej](#4)
6. [Ekstrakcja cech](#5)



### <a name='0'></a> Import bibliotek

In [2]:
import numpy as np
import pandas as pd
import sklearn

sklearn.__version__

'0.22.2.post1'

### <a name='1'></a> Załadowanie danych

In [3]:
def fetch_financial_data(company='AMZN'):
    """
    This function fetches stock market quotations.
    """
    import pandas_datareader.data as web
    return web.DataReader(name=company, data_source='stooq')

df_raw = fetch_financial_data()
df_raw.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-09-03,3452.0,3482.67,3436.44,3478.05,2578324
2021-09-02,3494.76,3511.9608,3455.0,3463.12,2925594
2021-09-01,3496.396,3527.0,3475.24,3479.0,3629911
2021-08-31,3424.8,3472.58,3395.59,3470.79,4356413
2021-08-30,3357.425,3445.0,3355.22,3421.57,3192244


### <a name='2'></a> Utworzenie kopii danych

In [4]:
df = df_raw.copy()
df = df[:5]
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5 entries, 2021-09-03 to 2021-08-30
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Open    5 non-null      float64
 1   High    5 non-null      float64
 2   Low     5 non-null      float64
 3   Close   5 non-null      float64
 4   Volume  5 non-null      int64  
dtypes: float64(4), int64(1)
memory usage: 240.0 bytes


### <a name='3'></a> Generowanie nowych zmiennych

In [5]:
df.index.month

Int64Index([9, 9, 9, 8, 8], dtype='int64', name='Date')

In [6]:
df['day'] = df.index.day
df['month'] = df.index.month
df['year'] = df.index.year
df

Unnamed: 0_level_0,Open,High,Low,Close,Volume,day,month,year
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-09-03,3452.0,3482.67,3436.44,3478.05,2578324,3,9,2021
2021-09-02,3494.76,3511.9608,3455.0,3463.12,2925594,2,9,2021
2021-09-01,3496.396,3527.0,3475.24,3479.0,3629911,1,9,2021
2021-08-31,3424.8,3472.58,3395.59,3470.79,4356413,31,8,2021
2021-08-30,3357.425,3445.0,3355.22,3421.57,3192244,30,8,2021


### <a name='4'></a> Dyskretyzacja zmiennej ciągłej

In [13]:
df = pd.DataFrame(data={'height': [100., 178.5, 185., 191., 155.5, 183., 168.]})
df

Unnamed: 0,height
0,100.0
1,178.5
2,185.0
3,191.0
4,155.5
5,183.0
6,168.0


In [14]:
df['height_cat'] = pd.cut(x=df.height, bins=3)
df

Unnamed: 0,height,height_cat
0,100.0,"(99.909, 130.333]"
1,178.5,"(160.667, 191.0]"
2,185.0,"(160.667, 191.0]"
3,191.0,"(160.667, 191.0]"
4,155.5,"(130.333, 160.667]"
5,183.0,"(160.667, 191.0]"
6,168.0,"(160.667, 191.0]"


In [15]:
df['height_cat'] = pd.cut(x=df.height, bins=(90, 149, 165, 195))
df

Unnamed: 0,height,height_cat
0,100.0,"(90, 149]"
1,178.5,"(165, 195]"
2,185.0,"(165, 195]"
3,191.0,"(165, 195]"
4,155.5,"(149, 165]"
5,183.0,"(165, 195]"
6,168.0,"(165, 195]"


In [16]:
df['height_cat'] = pd.cut(x=df.height, bins=(90, 149, 165, 195), labels=['pony', 'small horse', 'big horse'])
df

Unnamed: 0,height,height_cat
0,100.0,pony
1,178.5,big horse
2,185.0,big horse
3,191.0,big horse
4,155.5,small horse
5,183.0,big horse
6,168.0,big horse


In [17]:
pd.get_dummies(df, drop_first=True, prefix='height')

Unnamed: 0,height,height_small horse,height_big horse
0,100.0,0,0
1,178.5,0,1
2,185.0,0,1
3,191.0,0,1
4,155.5,1,0
5,183.0,0,1
6,168.0,0,1


### <a name='5'></a> Ekstrakcja cech

In [18]:
df = pd.DataFrame(data={'lang': [['PL', 'ENG'], ['GER', 'ENG', 'PL', 'FRA'], ['RUS']]})
df

Unnamed: 0,lang
0,"[PL, ENG]"
1,"[GER, ENG, PL, FRA]"
2,[RUS]


In [19]:
df['lang_number'] = df['lang'].apply(len)
df

Unnamed: 0,lang,lang_number
0,"[PL, ENG]",2
1,"[GER, ENG, PL, FRA]",4
2,[RUS],1


In [20]:
df['PL_flag'] = df['lang'].apply(lambda x: 1 if 'PL' in x else 0)
df

Unnamed: 0,lang,lang_number,PL_flag
0,"[PL, ENG]",2,1
1,"[GER, ENG, PL, FRA]",4,1
2,[RUS],1,0


In [21]:
df = pd.DataFrame(data={'website': ['wp.pl', 'onet.pl', 'google.com']})
df

Unnamed: 0,website
0,wp.pl
1,onet.pl
2,google.com


In [22]:
df.website.str.split('.', expand=True)

Unnamed: 0,0,1
0,wp,pl
1,onet,pl
2,google,com


In [23]:
new = df.website.str.split('.', expand=True)
df['portal'] = new[0]
df['extension'] = new[1]
df

Unnamed: 0,website,portal,extension
0,wp.pl,wp,pl
1,onet.pl,onet,pl
2,google.com,google,com
