<h1><center>DS300: PE1 - Data Preprocessing Exercises</center></h1>

## Introduction


In today's exercise, we are going to use a database that describes customers' purchasing behavior from the last Black Friday. 

Follow along the questions and write down your answers. 


#### Import all the necessary libraries

In [3]:
import sqlalchemy as db
import numpy as np
import pandas as pd

#### Create an SQLite engine and read the `data` Table into a variable named as `data`.
#### Data can be found at [link](https://github.com/BlueJayADAL/DS300/blob/master/datasets/blackfriday.sqlite), and you should upload it to the same directory as this notebook.

In [4]:
# Need to place the blackfriday.sqlite database file in the same directory of the notebook
##!curl -o blackfriday.sqlite https://github.com/BlueJayADAL/DS300/blob/master/datasets/blackfriday.sqlite



In [5]:
engine = db.create_engine('sqlite:///blackfriday.sqlite')

In [6]:
connection = engine.connect()

In [7]:
metadata = db.MetaData()

In [8]:
data = db.Table('data', metadata, autoload = True, autoload_with = engine)

#### Show all the columns of the table

In [9]:
data.columns.keys()



['User_ID',
 'Product_ID',
 'Gender',
 'Age',
 'Occupation',
 'City_Category',
 'Stay_In_Current_City_Years',
 'Marital_Status',
 'Product_Category_1',
 'Product_Category_2',
 'Product_Category_3',
 'Purchase']

#### Show the schema of the Table

In [10]:
metadata.tables['data']



Table('data', MetaData(), Column('User_ID', INTEGER(), table=<data>), Column('Product_ID', TEXT(), table=<data>), Column('Gender', TEXT(), table=<data>), Column('Age', TEXT(), table=<data>), Column('Occupation', INTEGER(), table=<data>), Column('City_Category', TEXT(), table=<data>), Column('Stay_In_Current_City_Years', TEXT(), table=<data>), Column('Marital_Status', INTEGER(), table=<data>), Column('Product_Category_1', INTEGER(), table=<data>), Column('Product_Category_2', INTEGER(), table=<data>), Column('Product_Category_3', INTEGER(), table=<data>), Column('Purchase', INTEGER(), table=<data>), schema=None)

#### Use a query to read all Table data into an array

In [11]:
query = db.select([data])

In [12]:
ResultProxy = connection.execute(query)

In [13]:
ResultSet = ResultProxy.fetchall()

In [14]:
ResultSet[:3]

[(1000001, 'P00069042', 'F', '0-17', 10, 'A', '2', 0, 3, None, None, 8370),
 (1000001, 'P00248942', 'F', '0-17', 10, 'A', '2', 0, 1, 6, 14, 15200),
 (1000001, 'P00087842', 'F', '0-17', 10, 'A', '2', 0, 12, None, None, 1422)]

#### Convert the array into a DataFrame. Remember to supply the columns names as well.

In [15]:
df = pd.DataFrame(data=ResultSet, columns=ResultSet[0].keys())



In [16]:
df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


#### Show the dimension of the DataFrame

In [17]:
df.shape



(537577, 12)

#### Show the total number of missing values in each column.

In [18]:
df.isna().sum()



User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2            166986
Product_Category_3            373299
Purchase                           0
dtype: int64

#### Show the total number of occurences of each category in column `Stay_In_Current_City_Years`. 

In [19]:
df['Stay_In_Current_City_Years'].value_counts()

1     189192
2      99459
3      93312
4+     82889
0      72725
Name: Stay_In_Current_City_Years, dtype: int64

#### Drop column `Product_Category_3` since half of the data are missing. 

In [20]:
df.drop('Product_Category_3', axis=1, inplace=True)



In [21]:
# Double check if we have fixed it.
df.columns

Index(['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category',
       'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
       'Product_Category_2', 'Purchase'],
      dtype='object')

#### Impute the missing values of column 'Product_Category_2' with a constant. And the constant must be the category that has the most occurrences. 

In [22]:
# Import Simple Imputer
from sklearn.impute import SimpleImputer


In [23]:
# Find the mode
cat2mode = df['Product_Category_2'].value_counts().idxmax()

# Use the mode to impute missing data
imp = SimpleImputer(missing_values=np.nan,strategy='constant', fill_value=cat2mode)
cat2 = imp.fit_transform(df['Product_Category_2'].values.reshape(-1,1))
df['Product_Category_2'] = cat2


In [24]:
df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,8.0,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,8.0,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,8.0,7969


#### Create the feature X matrix with everything other than `User_ID`, `Product_ID` and `Purchase`, and label y vector.

In [25]:
X = df.drop(['User_ID', 'Product_ID', 'Purchase'], axis=1)


X.head()

Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2
0,F,0-17,10,A,2,0,3,8.0
1,F,0-17,10,A,2,0,1,6.0
2,F,0-17,10,A,2,0,12,8.0
3,F,0-17,10,A,2,0,12,14.0
4,M,55+,16,C,4+,0,8,8.0


In [26]:
y = df['Purchase']


y.head()

0     8370
1    15200
2     1422
3     1057
4     7969
Name: Purchase, dtype: int64

#### Use info() to find all the categorical features in X vector

In [27]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 537577 entries, 0 to 537576
Data columns (total 8 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Gender                      537577 non-null  object 
 1   Age                         537577 non-null  object 
 2   Occupation                  537577 non-null  int64  
 3   City_Category               537577 non-null  object 
 4   Stay_In_Current_City_Years  537577 non-null  object 
 5   Marital_Status              537577 non-null  int64  
 6   Product_Category_1          537577 non-null  int64  
 7   Product_Category_2          537577 non-null  float64
dtypes: float64(1), int64(3), object(4)
memory usage: 32.8+ MB


#### Use LabelEncoder to transform columns `Age`, `City_Category`, `Stay_In_Current_City_Years`, `Product_Category_1` and `Product_Category_2` into numeric type. 

In [28]:
# Import LabelEncoder from sklearn library
from sklearn.preprocessing import LabelEncoder

# Create the Label Encoder object
LE = LabelEncoder()


In [31]:
# Encode the data into labels using label encoder object
X_LE = X[['Age', 'City_Category', 'Stay_In_Current_City_Years', 'Product_Category_1', 'Product_Category_2']]
X_LE = X_LE.apply(LE.fit_transform)


In [32]:
# Update the original DataFrame X with the label encoded data
X[['Age', 'City_Category', 'Stay_In_Current_City_Years', 'Product_Category_1', 'Product_Category_2']] = X_LE

In [32]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 537577 entries, 0 to 537576
Data columns (total 8 columns):
 #   Column                      Non-Null Count   Dtype 
---  ------                      --------------   ----- 
 0   Gender                      537577 non-null  object
 1   Age                         537577 non-null  int32 
 2   Occupation                  537577 non-null  int64 
 3   City_Category               537577 non-null  int32 
 4   Stay_In_Current_City_Years  537577 non-null  int32 
 5   Marital_Status              537577 non-null  int64 
 6   Product_Category_1          537577 non-null  int64 
 7   Product_Category_2          537577 non-null  int64 
dtypes: int32(3), int64(4), object(1)
memory usage: 26.7+ MB


#### Use `OneHotEncoder()` to transform the columns `Gender` and `Marital_Status`

In [33]:
# Import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

In [34]:
# Create OneHotEncoder Object

onehot = OneHotEncoder(drop='first')


In [35]:
# Covert the columns with OneHotEncoder object

X_onehot = X[['Gender', 'Marital_Status']]
X_onehot = onehot.fit_transform(X_onehot.values)


In [36]:
# Re-construct the result to a DataFrame

X_onehot = pd.DataFrame(data=X_onehot.toarray(), 
                        columns=onehot.get_feature_names_out())


X_onehot.head()

Unnamed: 0,x0_M,x1_1
0,0.0,0.0
1,0.0,0.0
2,0.0,0.0
3,0.0,0.0
4,1.0,0.0


In [43]:
# Update the orginal DataFrame X with the newly OneHot encoded data.
X = pd.concat([X, X_onehot]).drop(['Gender', 'Marital_Status'], axis = 1)



TypeError: cannot concatenate object of type '<class 'method'>'; only Series and DataFrame objs are valid

In [38]:
X.head()

AttributeError: 'function' object has no attribute 'head'

#### Split the X, y vectors into training and testing dataset. Use 20% as the split ratio, and use random seed as 101. 

In [39]:
from sklearn.model_selection import train_test_split

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)



TypeError: Singleton array array(<bound method DataFrame.drop of        Gender  Age  Occupation  City_Category  Stay_In_Current_City_Years  \
0           F    0          10              0                           2   
1           F    0          10              0                           2   
2           F    0          10              0                           2   
3           F    0          10              0                           2   
4           M    6          16              2                           4   
...       ...  ...         ...            ...                         ...   
537572      M    3          16              2                           1   
537573      M    3          16              2                           1   
537574      M    3          16              2                           1   
537575      M    3          16              2                           1   
537576      M    3          16              2                           1   

        Marital_Status  Product_Category_1  Product_Category_2  x0_M  x1_1  
0                    0                   2                   6   0.0   0.0  
1                    0                   0                   4   0.0   0.0  
2                    0                  11                   6   0.0   0.0  
3                    0                  11                  12   0.0   0.0  
4                    0                   7                   6   1.0   0.0  
...                ...                 ...                 ...   ...   ...  
537572               0                   0                   0   1.0   0.0  
537573               0                   0                  13   1.0   0.0  
537574               0                   7                  13   1.0   0.0  
537575               0                   4                   6   1.0   0.0  
537576               0                   4                   6   1.0   0.0  

[537577 rows x 10 columns]>, dtype=object) cannot be considered a valid collection.

#### Use StandardScalar to standardize the features in X_train and X_test.

In [77]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

In [None]:
X_train = sc.fit_transform(X_train.values)


In [78]:
X_test = sc.transform(X_test.values)

ValueError: could not convert string to float: 'F'

#### Use any regression techniques to fit your X_train and y_train. Then show the $R^2$ score of the predicted results. 

In [30]:
from 





In [31]:
pred = 


In [32]:
from sklearn import metrics

In [33]:
# R^2 score
print('R2:',r2_score(y_test, pred))



0.5977947069374374

# Great Job!