In this assignment, you are going to apply what you learned about machine learning to a dataset of your choice on Kaggle (Links to an external site.). Kaggle is an online platform where data scientists can find datasets and enter competitions to predict certain outcomes. In these competitions, Kaggle users can download data and create their model.

Objective

Predict the outcomes in a data set using either Random Forest, Decision Tree or k-NN. Write a Jupyter Notebook report documenting your investigation.

Dataset

You can choose from the following data sets:

- Speed dating experiment (Links to an external site.): predict the variable dec_o (decision by partner)
- Gender recognition of voice (Links to an external site.): predict the variable label (male or female)
- FIFA 18 (Links to an external site.): predict the first item from the variable Preferred Positions (remember the method .split() for strings?)
- Secondary school students (Links to an external site.): predict the variable romantic (has a romantic interest). Use the file student-por.csv, not the other one.
- Employee attrition (Links to an external site.): predict the variable attrition.
Tips

- Cut down the data set down to size. Though not strictly necessary, this is strongly recommended to make it easier. Select 7 variables with strong predictive value, based on your knowledge of the topic (domain knowledge) and/or correlation. Remember to subset the data with df[[‘column 1’, ‘column2’, ‘column3’]]. Don't spend too much time on this step. It's supposed to make the assignment easier, not harder.
- If you find the dataset is too big and calculations take too long, take a random sample with the Pandas method .sample() and run the analysis with the entire data at the end.

Included in your Jupyter Notebook

Please add sufficient comments: not just explaining what you are doing, but why you are doing it.

- Which dataset and variables you selected and why
- Your pre-processing steps (e.g., transformations of variables)
- The head()of the resulting data frame

Classification

- Choose one of the following: k-nearest neighbor, decision tree or random forest
- Explain briefly in your own words how the algorithm works
- Split the data set into a training and test set
- Train the model
- Evaluate the predictive performance of your model on the test set

Please provide a link to your Notebook on GitHub. Make sure the GitHub folder includes the data file so the Notebook runs without problems.

Notes

- Only comments on the code should be in coding formatting. Answers to questions in the assignment (e.g., "Explain how linear regression works in your own words" or "evaluate the performance of your model") are in text (Markdown) cells.
- Use Markdown formula notation for mathematical formulas
- The Jupyter Notebook should run in its entirety. An assignment that doesn't run will not be scored "complete". If you can't get a certain section to run, please comment out the code and explain what you would want it to do.

In [6]:
# Importing  libraries
import seaborn as sns
import sklearn as sk 
import pandas as pd
import matplotlib.pyplot as plt 

In [7]:
# Importing data set
DataFrame = pd.read_csv('Speed Dating Data.csv', encoding= 'ISO-8859-1')
DataFrame = DataFrame[DataFrame['dec_o'].notna()] # deleting all rows containing NaN values in dec_o
DataFrame.reset_index()
DataFrame.head()  # 8378 rows left
#DataFrame.info(verbose = True) # see what type of observation are left
DataFrame=DataFrame.dropna(axis='columns') # dropping all columns (variables) containing NaN values
DataFrame.head() 

Unnamed: 0,iid,gender,idg,condtn,wave,round,position,order,partner,match,samerace,dec_o,dec
0,1,0,1,1,1,10,7,4,1,0,0,0,1
1,1,0,1,1,1,10,7,3,2,0,0,0,1
2,1,0,1,1,1,10,7,10,3,1,1,1,1
3,1,0,1,1,1,10,7,5,4,1,0,1,1
4,1,0,1,1,1,10,7,7,5,1,0,1,1


# Random Forrest

Random forrest is a algorithm that makes use of multiple decision trees in order to get to a classification or regression model.

In [8]:
training_df=DataFrame[0:5500] # splitting the set into a train and test set
testing_df=DataFrame[5501:8377]

y_training=training_df[['dec_o']] # splitting the set into predictors and predicted
x_training=training_df.drop(columns=['dec_o'])

y_testing=testing_df[['dec_o']]
x_testing=testing_df.drop(columns=['dec_o'])

In [9]:
from sklearn.ensemble import RandomForestClassifier # importing the model we are using
from sklearn import metrics

rf = RandomForestClassifier()
rf_model = rf.fit(x_training, y_training['dec_o']); # train the model on training data

predict_test_y = rf_model.predict(x_testing) # test the model
print('Validation Accuracy:', metrics.accuracy_score(y_testing, predict_test_y)) # determining the accuracy of the model

Validation Accuracy: 0.6974965229485396


So in 70% the model is right in its predictions, which is a reasonable value. 