<a href="https://colab.research.google.com/github/erezatccsf/ProjectDemoForAI_club.ipynb/blob/master/ProjectDemoForAI_club.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Craig Persiko<br>
Project Demo Presentation for CCSF AI Club, based on:<br>
Data Science Principles and Practices Using Python<br>
UC Berkeley Extension<br>
Our class used the following textbook, which is available for free at:
https://jakevdp.github.io/PythonDataScienceHandbook/ <br>
Final Project, due 4/7/2020<br>
Analysis of how Random Forest and simple Decision Tree predictive models work with 4 different data sets, and deeper analysis of my assignment 2 solution (IMDB Data Set movie ratings prediction).<p>

This project started with Assignment 2 from the class, which is detailed here: https://docs.google.com/document/d/1EmFtIgqZRCAa5-q3wRSH7Kk388FRcabNbbshMSbpDEs/edit?usp=sharing
My code for the solution of this assignment follows:

In [0]:
# Craig Persiko
# IMDB dataset rating prediction
# This first code block begins my solution to Assignment 2. 
# See the last code block for the Tree analysis.

import pandas as pd
import numpy as np

# Part 1: Data Set Generation 15%
# Join two other sources (e.g. title.crew.tsv.gz, title.basics.tsv.gz, etc) to make some meaningful features for rating prediction.
# Only consider movies with startYear  after 1980 (including 1980)

def convertToInt(val):
  try:
    return np.int16(val)
  except:
    return np.int16(0)  

df_ratings = pd.read_csv("https://datasets.imdbws.com/title.ratings.tsv.gz", sep='\t')
df_crew = pd.read_csv("https://datasets.imdbws.com/title.crew.tsv.gz", sep='\t')
df_basics = pd.read_csv("https://datasets.imdbws.com/title.basics.tsv.gz", sep='\t', dtype="str", converters={'startYear':convertToInt})

df_ratingsCrew = pd.merge(df_ratings, df_crew)
df_fullData = pd.merge(df_ratingsCrew, df_basics)

columnsToKeep = ['tconst', 'titleType', 'primaryTitle', 'averageRating', 'startYear', 'genres', 'directors']
df_selectedData = df_fullData.loc[df_fullData.startYear >= 1980 , columnsToKeep]
df_selectedData.dropna()

print("Data for all titles after 1980 has shape:", df_selectedData.shape)
print("and here are the first few rows:")
display(df_selectedData)



Data for all titles after 1980 has shape: (864873, 7)
and here are the first few rows:


Unnamed: 0,tconst,titleType,primaryTitle,averageRating,startYear,genres,directors
4029,tt0015724,movie,Dama de noche,6.2,1993,"Drama,Mystery,Romance",nm0529960
4475,tt0016906,movie,Frivolinas,5.6,2014,"Comedy,Musical",nm0136068
5028,tt0018295,short,El puño de hierro,6.7,2004,"Action,Drama,Short",nm0305771
17176,tt0035423,movie,Kate & Leopold,6.4,2001,"Comedy,Fantasy,Romance",nm0003506
18140,tt0036606,movie,"Another Time, Another Place",6.5,1983,"Drama,War",nm0705535
...,...,...,...,...,...,...,...
1030403,tt9916576,tvEpisode,Destinee's Story,6.0,2019,Reality-TV,\N
1030404,tt9916578,tvEpisode,The Trial of Joan Collins,8.4,2019,"Adventure,Biography,Comedy",nm0373673
1030405,tt9916720,short,The Nun 2,5.6,2019,"Comedy,Horror,Mystery",nm10538600
1030406,tt9916766,tvEpisode,Episode #10.15,6.8,2019,"Family,Reality-TV",\N


In [0]:
# Part 2: Modeling 15%
# Train/Dev/Test random split = 80/10/10
# Model the problem as 3-class multiclass classification:
# Very Good: >7
# Good: 3.5 < x < 6.5
# Bad: < 3

# Setting up named constants for our 3 classes:
veryGood = 2
good = 1
bad = 0

df_veryGood = df_selectedData.query('averageRating > 7')
df_veryGood = df_veryGood.eval('averageRating = @veryGood')

df_good = df_selectedData.query('3.5 < averageRating < 6.5')
df_good = df_good.eval('averageRating = @good')

df_bad = df_selectedData.query('averageRating < 3')
df_bad = df_bad.eval('averageRating = @bad')

df_classifiedData = pd.concat([df_veryGood, df_good, df_bad])

print("Data with just 3 classes for ratings has shape:", df_classifiedData.shape)

# Code all features with numeric values
import hashlib

df_classifiedData['hashed_directors'] = df_classifiedData['directors'].apply(lambda x: int(hashlib.sha1(x.encode()).hexdigest(), 16) % (100))
df_classifiedData['hashed_genres'] = df_classifiedData['genres'].apply(lambda x: int(hashlib.sha1(str(x).encode()).hexdigest(), 16) % (100))
df_classifiedData['hashed_type'] = df_classifiedData['titleType'].apply(lambda x: int(hashlib.sha1(str(x).encode()).hexdigest(), 16) % (100))

# Train/Dev/Test random split = 80/10/10
from sklearn.model_selection import train_test_split

df_train, df_dev = train_test_split(df_classifiedData, test_size=0.2)
df_dev, df_test = train_test_split(df_dev, test_size=0.5)

print("Training data with hashed features has shape:", df_train.shape)
print("Dev data with hashed features has shape:", df_dev.shape)
print("Test data with hashed features has shape:", df_test.shape)

# Train the model

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
dtModel = DecisionTreeClassifier()
rfModel = RandomForestClassifier()

featureColumns = ['hashed_type', 'startYear', 'hashed_genres', 'hashed_directors']
X_train = df_train.loc[:, featureColumns]
y_train = df_train.averageRating
X_dev = df_dev.loc[:, featureColumns]
y_dev = df_dev.averageRating
X_test = df_test.loc[:, featureColumns]
y_test = df_test.averageRating

dtModel.fit(X_train, y_train)
print("Decision Tree Training accuracy:", dtModel.score(X_train, y_train)) 
print("Decision Tree Dev accuracy:", dtModel.score(X_dev, y_dev)) 
print("Decision Tree Test accuracy:", dtModel.score(X_test, y_test))

rfModel.fit(X_train, y_train)
print("Random Forest Training accuracy:", rfModel.score(X_train, y_train)) 
print("Random Forest Dev accuracy:", rfModel.score(X_dev, y_dev)) 
print("Random Forest Test accuracy:", rfModel.score(X_test, y_test))

print("Samples of processed test data set:")
display(df_test)

Data with just 3 classes for ratings has shape: (714030, 7)
Training data with hashed features has shape: (571224, 10)
Dev data with hashed features has shape: (71403, 10)
Test data with hashed features has shape: (71403, 10)
Decision Tree Training accuracy: 0.8733981765472039
Decision Tree Dev accuracy: 0.7074352618237328
Decision Tree Test accuracy: 0.7078974272789659
Random Forest Training accuracy: 0.8733771690265115
Random Forest Dev accuracy: 0.7247594638880719
Random Forest Test accuracy: 0.7249555340812011
Samples of processed test data set:


Unnamed: 0,tconst,titleType,primaryTitle,averageRating,startYear,genres,directors,hashed_directors,hashed_genres,hashed_type
960800,tt7590152,tvEpisode,Episode #2.5,0,2017,Reality-TV,\N,75,47,35
581508,tt1635830,tvEpisode,Liars,2,2011,"Comedy,Drama,Romance",nm0318795,60,33,35
654995,tt2145570,tvSeries,In Reverie,2,2012,Drama,nm3637034,28,71,55
457303,tt10749672,movie,A Planet in the Sea,1,2019,Documentary,nm3113346,21,61,56
767377,tt3687564,videoGame,Animaniacs,2,1994,Action,nm2120354,14,18,72
...,...,...,...,...,...,...,...,...,...,...
382948,tt0825900,tvEpisode,After the Fall,2,1987,Comedy,nm0694516,34,31,35
819159,tt4659808,movie,The Wolf's Lair,2,2015,"Biography,Documentary,Family",nm0610041,12,56,56
234499,tt0446975,short,Les couilles de mon chat,1,2005,"Action,Comedy,Short",nm0126875,55,38,14
620589,tt1872194,movie,The Judge,2,2014,"Crime,Drama",nm0229694,85,34,56



* In my Assignment 2 solution, I used a Random Forest model to predict film ratings, with pretty good results: 73% accuracy. When I used a single Decision Tree, I got just 71% accuracy, but with a much shorter run-time. (See code above)

* In the following Kaggle submission, the author used a Random Forest model to predict heart disease, getting 82% accuracy, whereas a single Decision Tree model only got 78% accuracy:
https://www.kaggle.com/faressayah/predicting-heart-disease-using-machine-learning#4.-Applying-machine-learning-algorithms

* In this Kaggle submission, the author used a Random Forest for 85% accuracy and also a simple Decision Tree for 84% accuracy to predict individuals' survival on the Titanic:
https://www.kaggle.com/ihelon/tree-randomforest-xgboost-lightgbm-catboost

* In another Kaggle submission, the author used a Random Forest model to predict happiness, again getting better accuracy than with a single Decision Tree. (This one is programmed in R):
https://www.kaggle.com/javadzabihi/happiness-2017-visualization-prediction


So I decided to investigate more about Random Forests and Decision Trees. I found a couple of Kaggles with great explanations.  **Here's an excellent overview:**
https://www.kaggle.com/faressayah/decision-trees-and-random-forest-tutorial

And this one has **more technical detail and some great tree diagrams derived from the actual trees** that were generated by the code and data: https://www.kaggle.com/akashram/demystify-the-random-forest
<p>
This led me back to my Assignment 2 solution,
where I added code, trying to do similar visualizations, but instead came up with some quantitative information about the trees that were generated. (See below)


#Here is my analysis of the trees used in my assignment 2 solution above:

In [0]:
# The following code was inspired by https://www.kaggle.com/akashram/demystify-the-random-forest
# but I ran out of RAM when I tried to run it.  The below analysis will show you why:
#from IPython.display import Image
#from sklearn.tree import export_graphviz
#import pydotplus
#dot_data = StringIO()
#export_graphviz(dtModel, out_file=dot_data,  
#                filled=True, rounded=True,
#                special_characters=True)
#graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
#Image(graph.create_png())

!pip install treeinterpreter
from treeinterpreter import treeinterpreter as ti

exampleIndex = 2 # 0

print()
print("Feature columns are:", featureColumns)
print("Looking at one row of features as an example:")
sampleRow = X_test.iloc[exampleIndex].to_numpy().reshape(1, -1)
print(sampleRow)
print("Its correct rating code (range is 0, 1, or 2) is:", y_test.iloc[exampleIndex])
print()
  
def analyzeTree(dt, description):
  ''' 
  function to print out quantitative info and a sample prediction
  for the decision tree dt, with its description
  '''
  print(description, "has a depth of:", dt.get_depth())
  print(description, "has:", dt.get_n_leaves(), "leaves.")
  print("For our example row shown at top...")
  print("The correct rating code (range is 0, 1, or 2) is:", y_test.iloc[exampleIndex])
  print("The tree's prediction is:", dt.predict(sampleRow))
  print("which is derived from the following probabilities from training data:", dt.predict_proba(sampleRow))
  prediction, bias, contributions = ti.predict(dt, sampleRow)
  print("Its prediction probabilities of", prediction, "are calculated by adding the bias of", bias)
  print("to the following contributions, along each column (one per feature):")
  print(contributions)
  print("Calculated prediction is:", bias + np.sum(contributions, axis=1))
  print(description, "has the following feature importances overall:", dt.feature_importances_)

analyzeTree(dtModel, "Decision tree (simple)")
print()
print("For the same example row shown at top, the Random Forest's overall prediction is:", rfModel.predict(sampleRow))
rfEstimators = rfModel.estimators_
print("By majority vote of its", len(rfEstimators), "trees. Here are the first 10:")
print()
count = 1
for treeFromForest in rfEstimators[:10]:
  analyzeTree(treeFromForest, "Sample tree #"+str(count)+" from forest")
  count = count + 1
  print()



Feature columns are: ['hashed_type', 'startYear', 'hashed_genres', 'hashed_directors']
Looking at one row of features as an example:
[[  55 2012   71   28]]
Its correct rating code (range is 0, 1, or 2) is: 2

Decision tree (simple) has a depth of: 45
Decision tree (simple) has: 129341 leaves.
For our example row shown at top...
The correct rating code (range is 0, 1, or 2) is: 2
The tree's prediction is: [2]
which is derived from the following probabilities from training data: [[0. 0. 1.]]
Its prediction probabilities of [[0. 0. 1.]] are calculated by adding the bias of [[0.01753253 0.34318061 0.63928686]]
to the following contributions, along each column (one per feature):
[[[ 0.01319772  0.04378918 -0.05698689]
  [-0.003097    0.23638683 -0.23328984]
  [-0.00949668 -0.00857333  0.01807001]
  [-0.01813657 -0.61478329  0.63291986]]]
Calculated prediction is: [[3.46944695e-18 0.00000000e+00 1.00000000e+00]]
Decision tree (simple) has the following feature importances overall: [0.19736