# Assignment 3 - Data Preprocessing

This assignment will be focused on data preprocessing methods that you've seen in class. We will be using the scikit-learn library. Some sample codes will be provided, but the goal is definitely for you to explore different functions and possibilities. Therefore, do not hesitate to go wild and stretch the capabilities of these libraries. 
In addition, we will be using the Random Forest Classifier. Do not worry if you do not know what this classifier is. It will be covered in the following classes!

*Don't forget that commenting your code is very important!*


### 1. Import packages 

###### Import the packages you think will be useful. We have imported some packages for you already, as they are outside the scope of this lecture. 

*Note*: we presented some example packages during class (eg. numpy, matplotlib, etc.). However, we only focus on the end results, so you can use any libraries of you choice. 

In [2]:
# IMPORT PACKAGES
from sklearn.ensemble import RandomForestClassifier

# =========== YOUR CODE HERE ======== 

import numpy as np
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

### 2. Get started with the machine learning. 

For this assignment, we will be using the kickstarter dataset from <a>kaggle.com</a>. Kickstarter is a famous online platform for entreneurs to advertise and raise money for their new startup projects. This dataset compiles the success, goal and other informations of the projects in 2018. 

We will try to use this dataset to predict the whether or not a project would succeed. We will be using the **'state'** column as out target labels. 

###### 1) Read the dataset './dataset/kickstarter.csv'. 

In [20]:
# =========== YOUR CODE HERE ======== 
ds = pd.read_csv('./dataset/kickstarter.csv')
#ds = ds.drop(columns='name')
#ds = ds.drop(columns = 'currency')
#ds = ds.drop(columns = 'ID')
#ds = ds.fillna(ds['usd pledged'].mean())
ds.isna().any()

ID                  False
name                 True
category            False
main_category       False
currency            False
deadline            False
goal                False
launched            False
pledged             False
state               False
backers             False
country             False
usd pledged          True
usd_pledged_real    False
usd_goal_real       False
dtype: bool

###### 2) Describe the dataset. What do you see? 

### <span style="background-color: #F9F2EB"> Your dataset description here: ...  </span>

###### 3) Create your train and test sets using a 0.8/0.2 split with the `train_test_split`  function from sciki-learn. 
Name your final datasets as `X_train`, `Y_train`, `X_test` and `Y_test`.

*Note: There are many ways of splitting data. A sample code for train-test splitting using *masks* is provided.*

In [4]:
# Random mask to split train / test sets. 
msk = np.random.rand(len(ds)) < 0.8 
train = ds[msk]
test = ds[~msk]

Y_train = train['state']
X_train = train.drop(columns='state')


Y_test = test['state']
X_test = test.drop(columns='state')

# Examine the size of the datasets to make sure that the split ratio was properly applied. (0.8/0.2)
print('Number of samples in train set: ', len(X_train))
print('Number of samples in test set: ', len(X_test))


Number of samples in train set:  302930
Number of samples in test set:  75731


In [23]:
# SCIKIT-LEARN TRAIN-TEST SPLIT
# =========== YOUR CODE HERE ======== 
Y = ds['state']
X = ds.drop(columns='state')
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)


 # get dummy = one hot
    

# Examine the size of the datasets to make sure that the split ratio was properly applied. (0.8/0.2)
print('Number of samples in train set: ', len(X_train))
print('Number of samples in test set: ', len(X_test))

Number of samples in train set:  302928
Number of samples in test set:  75733


###### 4) Fit the dataset onto the random forest classifer. 
Once again, do not worry about the what the classifier entails. We will cover this in more details in the following classes. 

In [18]:
#enc = OneHotEncoder(handle_unknown='error')

#X_train = enc.fit(X_train)
randomForest = RandomForestClassifier()

# =========== YOUR CODE HERE ======== 
randomForest.fit(X_train, Y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

### UH OH!

As you may see, the dataset is not ready to be fit onto the algorithm just yet. There is some preprocessing to do before you can do anything with it. The machine does not quite understand the data yet. While you're at it, you can also play around to see how you can preprocess the data to enhance the model performance too!

Things that you can consider doing include, but not limited to: 
- Encode categorical data to machine intepretable 
- Normalize your data 
- Select / drop columns
- Take care of missing values 
- etc. 

###### 5) Preprocess your data the way you want, but make it work!

In [None]:
# =========== YOUR CODE HERE ======== 

In [26]:
# Make a copy of the dataset because you never want to overwrite the original one! 
processed_ds = ds.copy() 

* **Missing values** 

In [27]:
# Check which row as missing values.
processed_ds.isna().any()

ID                  False
name                 True
category            False
main_category       False
currency            False
deadline            False
goal                False
launched            False
pledged             False
state               False
backers             False
country             False
usd pledged          True
usd_pledged_real    False
usd_goal_real       False
dtype: bool

In [32]:
# Fill 'usd pledged' missing values with those from 'usd_pledged_real', 
# assuming they are similar from examing the other rows.
'''processed_ds['usd pledged'] = processed_ds.apply(
    lambda row: row['usd_pledged_real'] if np.isnan(row['usd pledged']) else row['usd pledged'],
    axis=1)'''
processed_ds['usd pledged'] = processed_ds.fillna(processed_ds['usd pledged'].mean())

In [33]:
# Fill 'name' missing values with 'no name'
processed_ds['name'] = processed_ds['name'].fillna(value='no name')

In [42]:
# Verify that we have handled all the missing values. 
processed_ds.isna().any()

category            False
main_category       False
currency            False
deadline            False
goal                False
launched            False
pledged             False
state               False
backers             False
country             False
usd pledged         False
usd_pledged_real    False
usd_goal_real       False
dtype: bool

* **Dropping columns**

In [35]:
# Drop ID and name. These columns don't seem to give relevant information to the success of the project.
processed_ds = processed_ds.drop(columns=['ID', 'name'])
processed_ds

Unnamed: 0,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.00,failed,0,GB,1000002330,0.00,1533.95
1,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.00,failed,15,US,1000003930,2421.00,30000.00
2,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.00,failed,3,US,1000004038,220.00,45000.00
3,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.00,failed,1,US,1000007540,1.00,5000.00
4,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.00,canceled,14,US,1000011046,1283.00,19500.00
5,Restaurants,Food,USD,2016-04-01,50000.0,2016-02-26 13:38:27,52375.00,successful,224,US,1000014025,52375.00,50000.00
6,Food,Food,USD,2014-12-21,1000.0,2014-12-01 18:30:44,1205.00,successful,16,US,1000023410,1205.00,1000.00
7,Drinks,Food,USD,2016-03-17,25000.0,2016-02-01 20:05:12,453.00,failed,40,US,1000030581,453.00,25000.00
8,Product Design,Design,USD,2014-05-29,125000.0,2014-04-24 18:14:43,8233.00,canceled,58,US,1000034518,8233.00,125000.00
9,Documentary,Film & Video,USD,2014-08-10,65000.0,2014-07-11 21:55:48,6240.57,canceled,43,US,100004195,6240.57,65000.00


* **Encode categorical data.**
Machines do not understand words (strings). You must encode the data into numbers. 

In [36]:
# Look at the columns datatypes and see which columns need to be taken care of. 
processed_ds.dtypes

category             object
main_category        object
currency             object
deadline             object
goal                float64
launched             object
pledged             float64
state                object
backers               int64
country              object
usd pledged          object
usd_pledged_real    float64
usd_goal_real       float64
dtype: object

In [37]:
# Convert 'category', 'main_category', 'currency', 'country' into numerical categorical codes 
processed_ds['category'] = processed_ds["category"].astype('category').cat.codes
processed_ds['main_category'] = processed_ds["main_category"].astype('category').cat.codes
processed_ds['currency'] = processed_ds["currency"].astype('category').cat.codes
processed_ds['country'] = processed_ds["country"].astype('category').cat.codes

In [38]:
# Examine the dataset to see if the encoding have been applied properly. 
processed_ds.head()

Unnamed: 0,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,108,12,5,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,9,1000002330,0.0,1533.95
1,93,6,13,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,22,1000003930,2421.0,30000.0
2,93,6,13,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,22,1000004038,220.0,45000.0
3,90,10,13,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,22,1000007540,1.0,5000.0
4,55,6,13,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,22,1000011046,1283.0,19500.0


In [39]:
# 'launched' and 'deadline' hold dates. This is not categorical therefore label encoding wouldn't work. 
# We chose to keep the year of the date for simplicity purposes. 
processed_ds['launched'] = pd.to_datetime(processed_ds['launched']).dt.year
processed_ds['deadline'] = pd.to_datetime(processed_ds['deadline']).dt.year

In [40]:
# Verify that all columns now hold numerical values. 
processed_ds.dtypes

category              int16
main_category          int8
currency               int8
deadline              int64
goal                float64
launched              int64
pledged             float64
state                object
backers               int64
country                int8
usd pledged          object
usd_pledged_real    float64
usd_goal_real       float64
dtype: object

###### 6) Refit your processed dataset onto the classifier. Test your model on your test set and print out the accuracy. 
If everything goes well, you should get an accuracy of around 0.85~0.88

In [1]:


# Random mask to split train / test sets. 
msk = np.random.rand(len(processed_ds)) < 0.8 
train = processed_ds[msk]
test = processed_ds[~msk]

Y_train = train['state']
X_train = train.drop(columns='state')

Y_test = test['state']
X_test = test.drop(columns='state')

randomForest.fit(X_train, Y_train)
prediction = randomForest.predict(X_test)
print('Accuracy: ',accuracy_score(prediction, Y_test))

NameError: name 'np' is not defined

###### 7) What other preprocessing techniques can you use to improve performance? Refer for documentation here for other preprocessing techniques [https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing]