<a href="https://colab.research.google.com/github/aashixomen/ML-bootcamp/blob/main/01_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

scikit-learn

Strona biblioteki: https://scikit-learn.org

Dokumentacja/User Guide: https://scikit-learn.org/stable/user_guide.html

Podstawowa biblioteka do uczenia maszynowego w języku Python.

aby zainstalować bibiliotekę scikit-learn, użyj polecenia poniżej:

In [None]:
!pip install scikit-learn

Aby zaktualizaować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:

In [None]:
!pip install -upgrade scikit-learn

Kurs stworzony w oparciu o wersję 0.22.1

Postprocessing danych:

  1. [Import bibliotek](0)
  2. [Wygenerowanie danych](1)
  3. [Utworzenie kopii danych](2)
  4. [Zmiana typu danych i wstępna espozycja](3)
  5. [LabelEncoder](4)
  6. [OneHotEncoder](5)
  7. [Pandas get_dummies()](6)
  8. [Standaryzacja - StandardScaler](7)
  9. [Przygotowanie danych do modelu](8)

### <a name='0'></a> Import bibliotek

In [5]:
    import numpy as np
    import pandas as pd
    import sklearn

    sklearn.__version__

'1.6.1'

### <a name='1'></a> Wygenerowanie danych

In [6]:
    data = {
    'size': ['XL', 'L', 'M', 'L', 'M'],
    'color': ['red', 'green', 'blue', 'green', 'red'],
    'gender': ['female', 'male', 'male', 'female', 'female'],
    'price': [199.0, 89.0, 99.0, 129.0, 79.0],
    'weight': [500, 450, 300, 380, 410],
    'bought': ['yes', 'no', 'yes', 'no', 'yes']
}

df_raw = pd.DataFrame(data=data)
df_raw

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500,yes
1,L,green,male,89.0,450,no
2,M,blue,male,99.0,300,yes
3,L,green,female,129.0,380,no
4,M,red,female,79.0,410,yes


### <a name='3'> </a> Utworzenie kopii danych

In [8]:
    df = df_raw.copy()
    df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   size    5 non-null      object 
 1   color   5 non-null      object 
 2   gender  5 non-null      object 
 3   price   5 non-null      float64
 4   weight  5 non-null      int64  
 5   bought  5 non-null      object 
dtypes: float64(1), int64(1), object(4)
memory usage: 372.0+ bytes


### <a name='4'></a> Zmiana typu danych i wstępna eksploracja

In [9]:
for col in ['size', 'color', 'gender', 'bought']:
    df[col] = df[col].astype('category')

df['weight'] = df['weight'].astype('float')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   size    5 non-null      category
 1   color   5 non-null      category
 2   gender  5 non-null      category
 3   price   5 non-null      float64 
 4   weight  5 non-null      float64 
 5   bought  5 non-null      category
dtypes: category(4), float64(2)
memory usage: 744.0 bytes


In [11]:
df.describe()

Unnamed: 0,price,weight
count,5.0,5.0
mean,119.0,408.0
std,48.476799,75.299402
min,79.0,300.0
25%,89.0,380.0
50%,99.0,410.0
75%,129.0,450.0
max,199.0,500.0


In [13]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price,5.0,119.0,48.476799,79.0,89.0,99.0,129.0,199.0
weight,5.0,408.0,75.299402,300.0,380.0,410.0,450.0,500.0


In [14]:
df.describe(include=['category']).T

Unnamed: 0,count,unique,top,freq
size,5,3,L,2
color,5,3,green,2
gender,5,2,female,3
bought,5,2,yes,3


### <a name='5'></a> LabelEncoder

In [18]:
    from sklearn.preprocessing import LabelEncoder

    le = LabelEncoder()
    le.fit(df['bought'])
    le.transform(df['bought'])

array([1, 0, 1, 0, 1])

In [19]:
le.fit_transform(df['bought'])

array([1, 0, 1, 0, 1])

In [23]:
le.classes_

array(['no', 'yes'], dtype=object)

In [21]:
df['bought'] = le.fit_transform(df['bought'])
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,1
1,L,green,male,89.0,450.0,0
2,M,blue,male,99.0,300.0,1
3,L,green,female,129.0,380.0,0
4,M,red,female,79.0,410.0,1


In [24]:
le.inverse_transform(df['bought'])

array(['yes', 'no', 'yes', 'no', 'yes'], dtype=object)

In [28]:
# Apply inverse_transform to the numerical labels produced by transform
original_labels = le.inverse_transform(le.transform(df['bought']))
print("Original labels after inverse_transform:", original_labels)

# Display the DataFrame
df

Original labels after inverse_transform: ['yes' 'no' 'yes' 'no' 'yes']


Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,yes
1,L,green,male,89.0,450.0,no
2,M,blue,male,99.0,300.0,yes
3,L,green,female,129.0,380.0,no
4,M,red,female,79.0,410.0,yes


###<a name='5'></a> *OneHotEncoder*

In [34]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
encoder.fit(df[['size']])

In [40]:
encoder.transform(df[['size']])

### pd.DataFrame(encoder.transform(df[['size']]), columns=encoder.get_feature_names_out(['size']))

Unnamed: 0,size_L,size_M,size_XL
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,1.0,0.0,0.0
4,0.0,1.0,0.0


In [36]:
encoder.categories_

[array(['L', 'M', 'XL'], dtype=object)]

In [41]:
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoder.fit(df[['size']])
encoder.transform(df[['size']])

array([[0., 1.],
       [0., 0.],
       [1., 0.],
       [0., 0.],
       [1., 0.]])

In [43]:
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,yes
1,L,green,male,89.0,450.0,no
2,M,blue,male,99.0,300.0,yes
3,L,green,female,129.0,380.0,no
4,M,red,female,79.0,410.0,yes


In [44]:
pd.get_dummies(data=df)

Unnamed: 0,price,weight,size_L,size_M,size_XL,color_blue,color_green,color_red,gender_female,gender_male,bought_no,bought_yes
0,199.0,500.0,False,False,True,False,False,True,True,False,False,True
1,89.0,450.0,True,False,False,False,True,False,False,True,True,False
2,99.0,300.0,False,True,False,True,False,False,False,True,False,True
3,129.0,380.0,True,False,False,False,True,False,True,False,True,False
4,79.0,410.0,False,True,False,False,False,True,True,False,False,True


In [45]:
pd.get_dummies(data=df, drop_first=True)

Unnamed: 0,price,weight,size_M,size_XL,color_green,color_red,gender_male,bought_yes
0,199.0,500.0,False,True,False,True,False,True
1,89.0,450.0,False,False,True,False,True,False
2,99.0,300.0,True,False,False,False,True,True
3,129.0,380.0,False,False,True,False,False,False
4,79.0,410.0,True,False,False,True,False,True


In [46]:
pd.get_dummies(data=df, drop_first=True, prefix='new')

Unnamed: 0,price,weight,new_M,new_XL,new_green,new_red,new_male,new_yes
0,199.0,500.0,False,True,False,True,False,True
1,89.0,450.0,False,False,True,False,True,False
2,99.0,300.0,True,False,False,False,True,True
3,129.0,380.0,False,False,True,False,False,False
4,79.0,410.0,True,False,False,True,False,True
