# Background

- **What is the idea behind this project?**

> This project aims to build a tool that verifies if legal patents are valid or not

> We determine if a patent is valid or not based on the Eligibility 101 Abstractness law.

> This was a supervised ML project

- **Why did I create this project?**

> I am in a club called SAAS (Student's Association for Applied Statistics) at Berkeley, particularly the Data Consulting team. We work with clients on statistical modeling, data science/ ML/ NLP, and CV projects. Last semester, my team and I worked with a client called AiLanthus, an insurance and litigation startup to create the aforementioned tool. 

> The reason for the tool was because validating patents is quite a time sink for examiners, and because of the high volume of patents to check they are often not checked thoroughly enough. The business value behind our model is to make the patent validation process more thorough, accurate, efficient, and time sensitive. 

> We all worked individually or in pairs on different phases of the deliverable. Before this phase, which I created, we downloaded data from the USPTO website and used 2 APIs (google__patent__scraper and Aylien) to scrape patents for patent data, features such as examiner rates collection, forward and backward citations, number of said count, and finally for patent themes such as automobile, piracy, internet, etc.

> Our team did some research and decided that these were important features to collect because they helped to gauge patent validity, based on patents that were previously marked valid

- **What phase did I work on?**

> I used the features we engineered and created two layers of our multi-layer ML model.

- **What were the other phases of the project?**

> Other phases that I worked on with my teammates included feature engineering by vectorizing words using Law2Vec, EDA, PCA, statistical tests on the features to determine which ones are the most insightful, LDA, data balancing, forward feature subset selection, LSTMs, and combining all the pieces into a pipeline for our client to run locally

- **Why am I presenting this project to you?**

> This project was incredibly interesting and I feel that it is representative of the fields I'd like to explore at Stripe (ML) and my skill set!


- **What did I learn?**

> There are several reasons! At a personal/technical level, this project was my first introduction to NLP. I learned about important packages for NLP-related work and sklearn, relevant EDA techniques, feature engineering, LDA, PCA, and creating decision trees, random forests, and LSTMs

> This was not a class project so we had minimal direction. I learned how to quickly brainstorm ideas for every phase, create a ML model from scratch, deeply investigate it to find where to iterate upon it, how to write code at a production-ready level, and how to combine our team's work into one package all in 10 weeks!

> In terms of soft skills, I learned about how to take a challenging abstract idea and make it objective and programmable. I also learned how to present advanced technical work in a high-stakes setting to people with a nontechnical background so that they understand the business value of our work. Interns at Stripe would probably work in teams too so I pulled out this project to showcase my experience in programming with a team and how I learned to stitch our work together (ensemble method). 

- **What was the outcome of this work?**

> This project went into our final pipeline which was exciting, but to get to this phase I had put in a lot of time and effort into a previous project (feature engineering using Law2Vec) which got completely scrapped. Learning how to rebound from that and still make my efforts worthwhile was a crucial lesson! 

- **What work from here is transferable to SWE?**

> Web scraping and using APIs

> ML concepts and engineering

> Creating a pipeline that is production-ready and user-friendly

> Debugging and team programming

> Increasing model's speed by hyperparameter tuning (reminds me of playing around with time complexity in different data structures)

> Writing clean code with documentation for our client and testing our final pipelines

> Using AWS to run and deploy our final models, as well as tracking our edits as primitive source control

## Setting Up

In [6]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
import seaborn as sns
%matplotlib inline

In [7]:
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelBinarizer
from imblearn.over_sampling import SMOTE
import pydot

#### Key packages were:
- numpy, pandas, matplotlib, seaborn: the necessities
- sklearn: to create the models, tune them, and for data balancing

#### Other packages not featured here, but were in other phases
- NLTK, Gensim: textual data cleaning and word vectorization using Law2Vec

In [8]:
# I read in the dataset with the numerical features from the USPTO website 
# and the textual features (i.e. patent text) from google__patent__scraper API
full_merged = pd.read_csv('fullMerged.csv')
full_merged = full_merged.drop(columns=["Eligiblity_under_§_103"])
full_merged.head()

Unnamed: 0,Case_name,Case_number,Court,Patent/Application,Claims,Eligibility_under_§_101_(abstract),Date,Rationale,Fenwick_Coding,1st Patent Eligibility 101,...,Digit/Decimal Count,Excluding Phrase Count,Semicolon Count,Number of Backward Citations,Number of Forward Citations,Wherein Count,Said Count,Examiner Name,Examiner Rate,Examiner Count
0,"ULTRAMERCIAL, INC., AND ULTRAMERCIAL, LLC,",2010-1544,Court of Appeals,7346545,"ULTRAMERCIAL, INC., AND ULTRAMERCIAL, LLC,",0.0,10/14/2014,Abstract,Abstract Idea: An Idea of Itself,No,...,189,332,28,,859412202.0,1.0,18.0,ROBERT POND,0.849624,2660.0
1,"Hikansut LLC, v. United States",COFC-1-12-cv-00303,In the United States Court of Federal Claims,7175722,"Hikansut LLC, v. United States",1.0,10/18/2016,Not Abstract,Not Abstract,Yes,...,481,729,13,,975444364.0,36.0,0.0,SIKYIN IP,0.8839,7149.0
2,"Tenon & Groove, LLC, v. Plusgrade S.E.C.",12-1118-GMS-SRF,IN THE UNITED STATES DISTRICT COURT\n FOR THE ...,7418409,"Tenon & Groove, LLC, v. Plusgrade S.E.C.",0.0,03/03/2015,Abstract,Abstract Idea: Implemented on Generic Computer,No,...,599,1247,43,,806557070.0,2.0,2.0,WILLIAM ALLEN,0.86118,2406.0
3,"Tenon & Groove, LLC, v. Plusgrade S.E.C.",12-1118-GMS-SRF,IN THE UNITED STATES DISTRICT COURT\n FOR THE ...,8145536,"Tenon & Groove, LLC, v. Plusgrade S.E.C.",0.0,03/03/2015,Abstract,Abstract Idea: Implemented on Generic Computer,No,...,3488,11766,188,,,48.0,810.0,WILLIAM ALLEN,0.86118,2406.0
4,"PricePlay.com, v. AOL Advertising, Inc.",14-92-RGA,IN THE UNITED STATES DISTRICT COURT\n FOR THE ...,8050982,"PricePlay.com, v. AOL Advertising, Inc.",0.0,03/10/2015,Abstract,Abstract Idea: Implemented on Generic Computer,No,...,212,485,42,,,7.0,9.0,MICHAEL MISIASZEK,0.83942,1862.0


In [9]:
# I also read in the other dataset we have with the patent concepts which we got from calling the 
# Aylien API on our patent text
concepts_types = pd.read_csv("conceptsAndTypes.csv")
concepts_types = concepts_types.drop(columns=["Unnamed: 0", "Unnamed: 0.1", "Eligiblity_under_§_103"])
concepts_types.head()

Unnamed: 0,Case_name,Case_number,Court,Patent/Application,Claims,Eligibility_under_§_101_(abstract),Date,Rationale,Fenwick_Coding,1st Patent Eligibility 101,...,Type: Model,Type: Change,Type: Kind,Type: Eukaryote,Type: Transmission,Type: Village,Type: Element,Type: Cartoon,Type: Anime,Type: Activity
0,"ULTRAMERCIAL, INC., AND ULTRAMERCIAL, LLC,",2010-1544,Court of Appeals,7346545,"ULTRAMERCIAL, INC., AND ULTRAMERCIAL, LLC,",0.0,10/14/2014,Abstract,Abstract Idea: An Idea of Itself,No,...,0,0,0,0,0,0,0,0,0,0
1,"Hikansut LLC, v. United States",COFC-1-12-cv-00303,In the United States Court of Federal Claims,7175722,"Hikansut LLC, v. United States",1.0,10/18/2016,Not Abstract,Not Abstract,Yes,...,0,0,0,0,0,0,0,0,0,0
2,"Tenon & Groove, LLC, v. Plusgrade S.E.C.",12-1118-GMS-SRF,IN THE UNITED STATES DISTRICT COURT\n FOR THE ...,7418409,"Tenon & Groove, LLC, v. Plusgrade S.E.C.",0.0,03/03/2015,Abstract,Abstract Idea: Implemented on Generic Computer,No,...,0,0,0,0,0,0,0,0,0,0
3,"Tenon & Groove, LLC, v. Plusgrade S.E.C.",12-1118-GMS-SRF,IN THE UNITED STATES DISTRICT COURT\n FOR THE ...,8145536,"Tenon & Groove, LLC, v. Plusgrade S.E.C.",0.0,03/03/2015,Abstract,Abstract Idea: Implemented on Generic Computer,No,...,0,0,0,0,0,0,0,0,0,0
4,"PricePlay.com, v. AOL Advertising, Inc.",14-92-RGA,IN THE UNITED STATES DISTRICT COURT\n FOR THE ...,8050982,"PricePlay.com, v. AOL Advertising, Inc.",0.0,03/10/2015,Abstract,Abstract Idea: Implemented on Generic Computer,No,...,0,0,0,0,0,0,0,0,0,0


In [10]:
# I then concatenate the two datasets to have a final dataset that contains 
# all the numerical, textual, and thematic features
final = pd.concat([full_merged, concepts_types], axis=1)
final.head()

Unnamed: 0,Case_name,Case_number,Court,Patent/Application,Claims,Eligibility_under_§_101_(abstract),Date,Rationale,Fenwick_Coding,1st Patent Eligibility 101,...,Type: Model,Type: Change,Type: Kind,Type: Eukaryote,Type: Transmission,Type: Village,Type: Element,Type: Cartoon,Type: Anime,Type: Activity
0,"ULTRAMERCIAL, INC., AND ULTRAMERCIAL, LLC,",2010-1544,Court of Appeals,7346545,"ULTRAMERCIAL, INC., AND ULTRAMERCIAL, LLC,",0.0,10/14/2014,Abstract,Abstract Idea: An Idea of Itself,No,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Hikansut LLC, v. United States",COFC-1-12-cv-00303,In the United States Court of Federal Claims,7175722,"Hikansut LLC, v. United States",1.0,10/18/2016,Not Abstract,Not Abstract,Yes,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Tenon & Groove, LLC, v. Plusgrade S.E.C.",12-1118-GMS-SRF,IN THE UNITED STATES DISTRICT COURT\n FOR THE ...,7418409,"Tenon & Groove, LLC, v. Plusgrade S.E.C.",0.0,03/03/2015,Abstract,Abstract Idea: Implemented on Generic Computer,No,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Tenon & Groove, LLC, v. Plusgrade S.E.C.",12-1118-GMS-SRF,IN THE UNITED STATES DISTRICT COURT\n FOR THE ...,8145536,"Tenon & Groove, LLC, v. Plusgrade S.E.C.",0.0,03/03/2015,Abstract,Abstract Idea: Implemented on Generic Computer,No,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"PricePlay.com, v. AOL Advertising, Inc.",14-92-RGA,IN THE UNITED STATES DISTRICT COURT\n FOR THE ...,8050982,"PricePlay.com, v. AOL Advertising, Inc.",0.0,03/10/2015,Abstract,Abstract Idea: Implemented on Generic Computer,No,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
# Here, I am printing out all the columns so that I know which ones are duplicates to get rid of 
for col in final.columns:
    print (col)

Case_name
Case_number
Court
Patent/Application
Claims
Eligibility_under_§_101_(abstract)
Date
Rationale
Fenwick_Coding
1st Patent Eligibility 101
1st Patent Algorithm
Test Result
Invalid False Positive
Invalid True Positive
Invalid False Negative
Invalid True Negative
Valid False Positive
Valid True Positive
Valid False Negative
Kind Code
Party 1
Party 2
w_US
Heading Text
Number of headings
DETAILED in heading
Patent Text
Number of Paragraphs
Figure Count
Digit/Decimal Count
Excluding Phrase Count
Semicolon Count
Number of Backward Citations
Number of Forward Citations
Wherein Count
Said Count
Examiner Name
Examiner Rate
Examiner Count
Case_name
Case_number
Court
Patent/Application
Claims
Eligibility_under_§_101_(abstract)
Date
Rationale
Fenwick_Coding
1st Patent Eligibility 101
1st Patent Algorithm
Test Result
Invalid False Positive
Invalid True Positive
Invalid False Negative
Invalid True Negative
Valid False Positive
Valid True Positive
Valid False Negative
Kind Code
Party 1
Party 2

Source: https://stackoverflow.com/questions/14984119/python-pandas-remove-duplicate-columns

In [12]:
# Now, I am eliminating those duplicate columns
final = final.loc[:,~final.columns.duplicated()]
for col in final.columns:
    print (col)

Case_name
Case_number
Court
Patent/Application
Claims
Eligibility_under_§_101_(abstract)
Date
Rationale
Fenwick_Coding
1st Patent Eligibility 101
1st Patent Algorithm
Test Result
Invalid False Positive
Invalid True Positive
Invalid False Negative
Invalid True Negative
Valid False Positive
Valid True Positive
Valid False Negative
Kind Code
Party 1
Party 2
w_US
Heading Text
Number of headings
DETAILED in heading
Patent Text
Number of Paragraphs
Figure Count
Digit/Decimal Count
Excluding Phrase Count
Semicolon Count
Number of Backward Citations
Number of Forward Citations
Wherein Count
Said Count
Examiner Name
Examiner Rate
Examiner Count
concepts
Concept: Piracy
Concept: Internet
Concept: Credit_card
Concept: Patent
Concept: Compact_disc
Concept: Cable_television
Concept: Debit_card
Concept: Flowchart
Concept: Member_of_parliament
Concept: Bank
Concept: Demography
Concept: Multimedia
Concept: Telecommunications_network
Concept: Music_industry
Concept: Digital_terrestrial_television
Conce

In [13]:
# I drop rows where the class is null (i.e. where we have no verdict for )
final.dropna(axis=0, subset=['Eligibility_under_§_101_(abstract)'], inplace=True)

In [14]:
# I fill feature columns with -10 if null so as not to majorly affect the training and testing errors as well as 
# model evaluation metrics
final.fillna(-10, inplace=True)

In [15]:
# I choose X as the features and Y as the pre-determined eligibility from our client's dataset 
X = final.drop(columns = ["Case_name", "Case_number", "Court", "Patent/Application", "Claims",
                                      "Eligibility_under_§_101_(abstract)", "Date", "Rationale", "Fenwick_Coding",
                                      "1st Patent Eligibility 101", "1st Patent Algorithm", "Test Result",
                                      "Invalid False Positive", "Invalid True Positive", "Invalid False Negative",
                                      "Invalid True Negative", "Valid False Positive", "Valid True Positive",
                                      "Valid False Negative", "Kind Code", "Party 1", "Party 2", "w_US", 
                                      "Heading Text", "Examiner Name", "Patent Text", "concepts"])
Y = final['Eligibility_under_§_101_(abstract)']

In [16]:
# I choose to portion 80% of the data as the training set
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

## Optimizing RF - Hyperparameter Tuning

Source: RWL2V.ipynb on AWS server

#### Background
- A random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. In splitting the tree, it searches for the best feature instead of the most important feature (based on something like least MSE) and thus only a random subset of features are considered when splitting the node
- It was useful because after collecting all of our features, we were able to parse which ones were the most important and how many we needed for max eval. metrics
- This worked for my purposes to drop unimportant featuers, reduce variance, and generalize on new datasets better than DTs
- While I did use a Decision Tree as well, I didn't try a logistic regression because I was looking for a categorical/discrete output rather than a numerical value.
- A disadvantage of RFs can be the time it takes to run them, but I didn't encounter that problem here
- What is the difference b/w a DT and RF? A DT comes up with 1 way to predict the class of a label whereas RF uses several DTs all with different approaches to classifying the label, and then averages the results for a better result
- Some issues I had was during hyperparameter tuning, deciding how big the parameters should be without getting too much variance, and this I troubleshooted with my PM
- I learned about implementing a RF through teammates' skeleton code, towardsdatascience, medium, and other internet articles

#### Procedure
Training a base model to begin tuning

In [17]:
# Instantiate model
rf = RandomForestClassifier(n_estimators=5, random_state=42)

# Train the model on training data
rf.fit(X_train, y_train);

In [18]:
# Training error - errors the model made in classifying the data using the training datset it was trained on
y_train_pred = rf.predict(X_train)
train_acc = rf.score(X_train, y_train)
train_acc

0.9620493358633776

In [19]:
# Testing error - errors on new data the RF has never seen before (test data)
y_pred = rf.predict(X_test)
test_acc = rf.score(X_test, y_test)
test_acc

0.8181818181818182

#### Observations
Not overfitting but our dataset is imbalanced (more invalid patents than valid), so I focused the next section on hyperparameter tuning to see if we could mitigate this issue slightly

#### A note on overfitting
- Overfitting occurs when the model fits the testing data too well, meaning it won't generalize to new data very well
- We observe it numerically when the test error is greater than the train error

#### Tuning depths

- First I looked at the depths because increasing tree depth should theoretically increase the performance on the training set
- One thing I had to be careful about was the greater possibility to overfit the data, and generalization performance on a new dataset may see less accuracy.

#### A note on cross validation
- Cross validation is used to understand how well our model is/will generalize when it sees new data
- K-Fold cross val allows us to not compromise on the amount of data we have for training our model and for validating it. In this case, we divide the data into k subsets, leaving one of the k subsets as our testing sets and the remaining k - 1 subsets as our training set, and then the error estimation is averaged over all k trials to get the accuracy of our model. 
- It is useful because this is a common approach for validation, and it reduces bias because we most of the data for fitting the model, and reduces variance because most of the data also goes into the test set
- This worked for my purposes to mitigate over-fitting (which we didn't actually see here) and is good practice when hyperparameter tuning
- Other things I could've tried are the holdout method (k=1-fold cross val, not exhaustive enough for the technical requirements of this project) or stratified k-fold (when there is an imbalance in the dataset so each fold contains the same percentage of valid and invalid patents)
- Some issues I had were understanding what bias and variance are, and for this I chatted with a teammate who explained the concepts to me and how they could be observed in our project

#### A note on bias and variance
- Bias is an error made by the model when it makes incorrect assumptions in its classification algorithm. High bias can cause an algorithm to omit key relationships between the features and its output (i.e. underfitting)
- Variance is the variability in the model's output for a certain datapoint fed in. High variance can mean the model is poor at generalizing (i.e. overfitting)
- High bias, low variance: model is consistent but inaccurate on average
- High bias, high variance: model is inaccurate and also inconsistent on average
- Low bias low variance: model is accurate and consistent on average

In [17]:
# I look at the resulting accuracy, precision, and recall scores for varying depths
# and create a list to store the scores for each depth which can then be added to a dataframe at the end
accuracy = []
presision = []
recall = []


# I iterate through depths 2 through 21 exclusive, and print a dataframe for the 5-fold cross val score
# for each RF's accuracy, precision, and recall scores at each depth
depths = np.arange(2, 21)
for depth in depths:
    rf = RandomForestClassifier(n_estimators=100, max_depth=depth)
    acc = cross_val_score(rf, X_train, y_train, cv=5)
    accuracy.append(np.mean(acc))
    pre = cross_val_score(rf, X_train, y_train, cv=5, scoring='precision')
    presision.append(np.mean(pre))
    rec = cross_val_score(rf, X_train, y_train, cv=5, scoring='recall')
    recall.append(np.mean(rec))

pd.DataFrame({"accuracy" : accuracy, "precision": presision, "recall" : recall}, index = depths)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,accuracy,precision,recall
2,0.738149,0.0,0.0
3,0.743845,0.6,0.014815
4,0.76469,1.0,0.079365
5,0.770386,0.96,0.130688
6,0.774178,0.909524,0.166931
7,0.791285,0.808413,0.210053
8,0.787493,0.824286,0.239418
9,0.789344,0.783333,0.239153
10,0.789326,0.75798,0.289683
11,0.787457,0.707273,0.304233


#### Observation 
Since focus is precision and recall, precision starts dropping immediately with recall increasing. After much investigation with the depths, max recall score is around 0.36 and best value for depth is 18

#### Tuning max_features
- After tuning depths, I also tuned the number of features we had in the RF (max_features = the maximum number of features the RF will consider when deciding how to split on a node (i.e. for each sub-DT is creates))
- One thing to be cautious of is finding the perfect value for this parameter so that we don't sacrifice on model speed and the diversity/value of findings in each subtree

In [18]:
# I look at the resulting accuracy, precision, and recall scores for a varying number of features now
# and create a list to store the scores for each depth which can then be added to a dataframe at the end
accuracy = []
presision = []
recall = []

# I iterate through features 15 through 45 exclusive (5 at a time), and print a dataframe for the 5-fold 
# cross val score for each RF's accuracy, precision, and recall scores at each feature number
features = [15, 20, 25, 30, 35, 40, 45]
for feat in features:
    rf = RandomForestClassifier(n_estimators=100, max_features=feat)
    acc = cross_val_score(rf, X_train, y_train, cv=5)
    accuracy.append(np.mean(acc))
    pre = cross_val_score(rf, X_train, y_train, cv=5, scoring='precision')
    presision.append(np.mean(pre))
    rec = cross_val_score(rf, X_train, y_train, cv=5, scoring='recall')
    recall.append(np.mean(rec))

pd.DataFrame({"accuracy" : accuracy, "precision": presision, "recall" : recall}, index = features)

Unnamed: 0,accuracy,precision,recall
15,0.795022,0.71317,0.362963
20,0.787439,0.73228,0.333333
25,0.791267,0.68663,0.348148
30,0.79504,0.693712,0.362698
35,0.787457,0.663553,0.355291
40,0.787457,0.667587,0.384392
45,0.789416,0.66201,0.384656


#### Observation
Precision and recall fluctuates but number of features seem to affect recall more than depth. The best value for max_features might be 15 as we don't want to encounter high variance with an even larger parameter value

#### Final step
- To see how different values of depth and max_features affect accuracy, precision, and recall, these next few cells result in a dataframe for each depth and feature number together as well as the corresponding average evaluation metric score

In [20]:
# Repeating the same process as above, this time iterating through different depths and feature numbers
accuracy = []
presision = []
recall = []
scores = []

depths = np.arange(2, 21)
features = [15, 20, 25, 30, 35, 40, 45]
for depth in depths:
    for feat in features:
        clf = RandomForestClassifier(n_estimators=100, max_depth=depth,  max_features=feat)
        acc = cross_val_score(clf, X_train, y_train, cv=5)
        pre = cross_val_score(clf, X_train, y_train, cv=5, scoring='precision')
        rec = cross_val_score(clf, X_train, y_train, cv=5, scoring='recall')
        score = [depth, feat, acc, pre, rec]
        scores.append(score)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

In [21]:
s = pd.DataFrame(data = scores, columns = ["Depth", "Features", "Accuracy", "Precision", "Recall"])

In [22]:
# I check where we have the value 0 for the cross val score in the Precision column, so that we can eliminate it
# because it won't be a very useful data point for classifying patents (it does not tell us anything about 
# patent validity)
i = [bool(sum(i == [0.0, 0.0, 0.0, 0.0, 0.0])) for i in s.Precision.values]
s[i]

Unnamed: 0,Depth,Features,Accuracy,Precision,Recall
0,2,15,"[0.7358490566037735, 0.7358490566037735, 0.733...","[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0]"
1,2,20,"[0.7358490566037735, 0.7358490566037735, 0.733...","[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0]"
2,2,25,"[0.7358490566037735, 0.7358490566037735, 0.733...","[0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 0.03571428571428571, 0.0, 0.0, 0.0]"
3,2,30,"[0.7358490566037735, 0.7358490566037735, 0.733...","[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.07142857142857142, 0.0, 0.037037037037..."
4,2,35,"[0.7358490566037735, 0.7547169811320755, 0.733...","[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.037037037037037035, 0.0]"
5,2,40,"[0.7358490566037735, 0.7358490566037735, 0.733...","[0.0, 1.0, 1.0, 0.0, 0.0]","[0.0, 0.07142857142857142, 0.0, 0.0, 0.0]"
6,2,45,"[0.7452830188679245, 0.7452830188679245, 0.733...","[0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 0.03571428571428571, 0.0, 0.0, 0.0]"
7,3,15,"[0.7452830188679245, 0.7547169811320755, 0.742...","[0.0, 0.0, 1.0, 1.0, 0.0]","[0.0, 0.07142857142857142, 0.0, 0.037037037037..."
8,3,20,"[0.7452830188679245, 0.7735849056603774, 0.752...","[0.0, 1.0, 1.0, 0.0, 0.0]","[0.03571428571428571, 0.10714285714285714, 0.0..."
9,3,25,"[0.7547169811320755, 0.7547169811320755, 0.752...","[1.0, 1.0, 1.0, 1.0, 0.0]","[0.0, 0.07142857142857142, 0.07142857142857142..."


In [23]:
# I create the final dataframe for each depth and feature and the corresponding average score for acc, pre, and rec
# This dataframe has gotten rid of the above 14 rows with a 0 in the precision score
s = s[~np.array(i)]
s["Accuracy"] = s.Accuracy.apply(np.mean)
s["Precision"] = s.Precision.apply(np.mean)
s["Recall"] = s.Recall.apply(np.mean)
s

Unnamed: 0,Depth,Features,Accuracy,Precision,Recall
14,4,15,0.760916,0.900000,0.093915
15,4,20,0.768500,1.000000,0.086772
16,4,25,0.766595,0.960000,0.108995
17,4,30,0.764690,0.893333,0.123280
18,4,35,0.764690,0.900000,0.123545
...,...,...,...,...,...
128,20,25,0.795058,0.652080,0.377249
129,20,30,0.796981,0.698254,0.369841
130,20,35,0.781761,0.681078,0.341005
131,20,40,0.795040,0.701989,0.391799


#### Observation
After a cursory look, it seems the best depth = 19 and the best number of features = 40

#### Procedure
Training a new model with the new "best" features

In [26]:
# Instantiate optimized model
rf_optimized = RandomForestClassifier(n_estimators=10, max_depth=19,  max_features=40)

# Train the model on training data
rf_optimized.fit(X_train, y_train);

In [27]:
# Training error
y_train_pred = rf_optimized.predict(X_train)
train_acc = rf_optimized.score(X_train, y_train)
train_acc

0.9715370018975332

In [28]:
# Testing error
y_pred2 = rf_optimized.predict(X_test)
test_acc = rf_optimized.score(X_test, y_test)
test_acc

0.8484848484848485

In [29]:
# Accuracy - correctly predicted observations / total observations
acc = accuracy_score(y_test, y_pred2)
acc

0.8484848484848485

In [30]:
# Precision - true positives / all predicted positives
# Higher precision means that a model returns more relevant results than irrelevant ones 
prec = precision_score(y_test, y_pred2)
prec

0.8666666666666667

In [31]:
# Recall - true positives / all actual positives (almost like a measure of quantity)
# High recall means that a model returns most of the relevant results (whether or not irrelevant ones 
# are also returned)
recall = recall_score(y_test, y_pred2)
recall

0.41935483870967744

#### Observations
The recall is significantly low while accuracy and precision are doing well. Something I struggled to understand in this phase was why this was happening, and I spoke to my PM and teammates about this issue.

## Optimizing DT - Hyperparameter Tuning

#### Background
- A decision tree is similar to a random forest, except each branch/decision being made considers all the features inputted instead of picking the best ones based on a certain measure. The decision algorithm is the same for every branch too, unlike a random forest where it changes based on the features and then final result is an average of the previous decisions made. 
- This was useful to get a baseline for how a tree structure would suit our dataset and how effective it'd be in classifying patents. 
- This worked for my purposes because I did not perform any initial feature selection and wanted to try out all the features first, DTs are quite interpretable compared to a random forest, and they perform well on the training set
- After trying a RF and DT, I didn't investigate with a KNN classifier, SVM, or boosted or bagged DTs
- Disadvantages of DTs are the high likelihoods of overfitting, high bias, and high variance
- Some issues I had was understanding the tradeoffs between RFs and DTs and for this I spoke to my teammates and browsed articles on the web. I also spoke to my PM who ultimately informed me that trying both models would be beneficial.
- I learned about implementing a DT through teammates' skeleton code, towardsdatascience, medium, and other internet articles

### NOTE
I actually started with a decision tree and then created the random forest!

#### Procedure
Training a base model to begin tuning

In [33]:
# Instantiate model
clf = tree.DecisionTreeClassifier() 

# Fit it on our data
clf = clf.fit(X_train, y_train)

In [34]:
# Training error
y_train_pred = clf.predict(X_train) 
train_acc = clf.score(X_train, y_train) 
train_acc

0.9810246679316889

In [35]:
# Testing error
y_pred = clf.predict(X_test)
test_acc = clf.score(X_test, y_test)
test_acc

0.7878787878787878

#### Observations
Just like above, we are not overfitting but I tuned the parameters of the decision tree anyway in hopes of getting solid model metrics

#### Tuning depth
- The procedure is the exact same as above! I followed through with the same steps since random forests and decision trees are quite similar in structure

In [36]:
# Same as above
accuracy = []
presision = []
recall = []

# Same as above
depths = np.arange(2, 21)
for depth in depths:
    clf = tree.DecisionTreeClassifier(max_depth=depth)
    acc = cross_val_score(clf, X_train, y_train, cv=5)
    accuracy.append(np.mean(acc))
    pre = cross_val_score(clf, X_train, y_train, cv=5, scoring='precision')
    presision.append(np.mean(pre))
    rec = cross_val_score(clf, X_train, y_train, cv=5, scoring='recall')
    recall.append(np.mean(rec))

pd.DataFrame({"accuracy" : accuracy, "precision": presision, "recall" : recall}, index = depths)

Unnamed: 0,accuracy,precision,recall
2,0.757161,0.738889,0.138095
3,0.745714,0.556825,0.225132
4,0.743845,0.544444,0.261376
5,0.759084,0.622619,0.31164
6,0.747673,0.547849,0.420635
7,0.728661,0.50764,0.37672
8,0.740054,0.499396,0.419577
9,0.73434,0.513741,0.462963
10,0.75513,0.53295,0.520899
11,0.751339,0.535043,0.506614


#### Observations
The recall scores are much higher than in the random forest and it seems to mostly be on the rise, just like precision. It seems that the best depth could be 20

#### Tuning max_features
This is the exact same process as above!

In [37]:
# Same as above
accuracy = []
presision = []
recall = []

# Same as above, but now I start at 5 to see if we get good metrics with a more "ideal" number of features
features = [5, 10, 15, 20, 25, 30, 35, 40]
for feat in features:
    clf = tree.DecisionTreeClassifier(max_features=feat)
    acc = cross_val_score(clf, X_train, y_train, cv=5)
    accuracy.append(np.mean(acc))
    pre = cross_val_score(clf, X_train, y_train, cv=5, scoring='precision')
    presision.append(np.mean(pre))
    rec = cross_val_score(clf, X_train, y_train, cv=5, scoring='recall')
    recall.append(np.mean(rec))

pd.DataFrame({"accuracy" : accuracy, "precision": presision, "recall" : recall}, index = features)

Unnamed: 0,accuracy,precision,recall
5,0.734214,0.478811,0.449206
10,0.768518,0.514848,0.492593
15,0.728607,0.454388,0.478307
20,0.745678,0.517714,0.507407
25,0.730476,0.511475,0.471429
30,0.732507,0.501881,0.499206
35,0.732489,0.560599,0.520899
40,0.732399,0.503307,0.485714


#### Observations
The recall scores are much higher than in the random forest but this time they rise and fall more regularly than they did with depths. It seems that the max_features could be 35, although this might be slightly higher than ideal

In [38]:
# Same as above
accuracy = []
presision = []
recall = []
scores = []

# Same as above
depths = np.arange(2, 21)
features = [5, 10, 15, 20, 25, 30, 35, 40]
for depth in depths:
    for feat in features:
        clf = RandomForestClassifier(max_depth=depth,  max_features=feat)
        acc = cross_val_score(clf, X_train, y_train, cv=5)
        pre = cross_val_score(clf, X_train, y_train, cv=5, scoring='precision')
        rec = cross_val_score(clf, X_train, y_train, cv=5, scoring='recall')
        score = [depth, feat, acc, pre, rec]
        scores.append(score)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

In [39]:
t = pd.DataFrame(data = scores, columns = ["Depth", "Features", "Accuracy", "Precision", "Recall"])

In [40]:
# Same as above
i = [bool(sum(i == [0.0, 0.0, 0.0, 0.0, 0.0])) for i in t.Precision.values]
t[i]

Unnamed: 0,Depth,Features,Accuracy,Precision,Recall
0,2,5,"[0.7358490566037735, 0.7358490566037735, 0.733...","[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0]"
1,2,10,"[0.7358490566037735, 0.7358490566037735, 0.733...","[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0]"
2,2,15,"[0.7358490566037735, 0.7358490566037735, 0.733...","[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0]"
3,2,20,"[0.7358490566037735, 0.7358490566037735, 0.733...","[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0]"
4,2,25,"[0.7358490566037735, 0.7358490566037735, 0.733...","[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0]"
5,2,30,"[0.7358490566037735, 0.7358490566037735, 0.733...","[0.0, 0.0, 0.0, 0.0, 0.0]","[0.03571428571428571, 0.0, 0.0, 0.0, 0.0]"
6,2,35,"[0.7358490566037735, 0.7452830188679245, 0.733...","[0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0]"
7,2,40,"[0.7358490566037735, 0.7358490566037735, 0.742...","[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.03571428571428571, 0.0, 0.0, 0.0]"
8,3,5,"[0.7358490566037735, 0.7358490566037735, 0.733...","[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0]"
9,3,10,"[0.7358490566037735, 0.7358490566037735, 0.742...","[0.0, 0.0, 1.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0]"


In [41]:
# Same as above
t = t[~np.array(i)]
t["Accuracy"] = t.Accuracy.apply(np.mean)
t["Precision"] = t.Precision.apply(np.mean)
t["Recall"] = t.Recall.apply(np.mean)
t

Unnamed: 0,Depth,Features,Accuracy,Precision,Recall
15,3,40,0.760898,0.810000,0.101058
18,4,15,0.762839,1.000000,0.079630
19,4,20,0.762803,0.950000,0.123545
20,4,25,0.766595,0.960000,0.130688
21,4,30,0.770386,0.966667,0.137831
...,...,...,...,...,...
147,20,20,0.789380,0.709919,0.355556
148,20,25,0.793172,0.690119,0.363228
149,20,30,0.796981,0.705016,0.376720
150,20,35,0.795040,0.644725,0.362434


#### Observations
After a cursory look, the recall is still suffering despite some success when tuning depth and max_features individually. This was something I had trouble understanding and worked with my teammates to tackle. We could choose the best depth = 18 and the best number of features = 15

#### Procedure
Training new model with new "best" parameters

In [44]:
# Instantiate optimized model
clf_optimized = tree.DecisionTreeClassifier(max_depth=18,  max_features=15)

# Train the model on training data
clf_optimized.fit(X_train, y_train);

In [45]:
# Train error
y_train_pred = clf_optimized.predict(X_train)
train_acc = clf_optimized.score(X_train, y_train)
train_acc

0.9582542694497154

In [46]:
# Test error
y_pred = clf_optimized.predict(X_test)
test_acc = clf_optimized.score(X_test, y_test)
test_acc

0.7575757575757576

In [47]:
# Accuracy
acc = accuracy_score(y_test, y_pred)
acc

0.7575757575757576

In [48]:
# Precision
prec = precision_score(y_test, y_pred)
prec

0.48484848484848486

In [49]:
# Recall
recall = recall_score(y_test, y_pred)
recall

0.5161290322580645

#### Observations
We see almost the same metric for precision and recall, but we suffered a hit to their magnitudes with this decision tree. I decided to keep this anyway for our client to investigate a model with some variance in its metrics and a model with more stable metrics, which I decided on after consulting my team.

## SMOTE on RF
Guide: https://beckernick.github.io/oversampling-modeling/

#### Background
- SMOTE (Synthetic Minority Oversampling Technique) is a statistical technique for increasing the number of cases in your dataset in a balanced way. This technique generates synthetic ("fake") data points from the minority class (in this case the invalid patents) to balance our dataset out
- It was useful for increasing our model evaluation metrics
- The purpose behind this was to avoid skewing our classification results and try to give our model a dataset with a more even portion of valid and invalid patents
- I used SMOTE rather than Random Oversampling because I wanted to create more diversity in the training set, and I also used SMOTE rather than the NearMiss algorithm too because I wanted to prevent information loss and preserve the amount of data we had in our training set (which is already not that much)
- A disadvantage of SMOTE is that it can create additional noise that affects the model's generalizability because it does not consider neighboring examples from other classes
- Some issues I had was understanding exactly what SMOTE does and what it means by synthetic data points
- I learned about implementing SMOTE through an article a teammate sent me, linked above

In [12]:
# Instantiate model
smt = SMOTE(random_state=12)

# Split your data using SMOTE instead of the normal train_test_split procedure
X_train_res, y_train_res = smt.fit_sample(X_train, y_train)

In [13]:
# Choose the RF Classifier as your model
rf_smt = RandomForestClassifier(n_estimators=5, random_state=42)

# Fit the model on your SMOTE-split dataset
rf_smt.fit(X_train_res, y_train_res)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=5,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [14]:
# Train error
y_train_pred = rf_smt.predict(X_train_res)
train_acc = rf_smt.score(X_train_res, y_train_res)
train_acc

0.9730077120822622

In [15]:
# Test error
y_pred = rf_smt.predict(X_test)
test_acc = rf_smt.score(X_test, y_test)
test_acc

0.7803030303030303

#### Observations
No overfitting, but still important to tune parameters

#### Tuning depth 
Same procedure as above!

In [30]:
# Same procedure as above
accuracy = []
presision = []
recall = []

# Same procedure as above
depths = np.arange(2, 21)
for depth in depths:
    rf = RandomForestClassifier(n_estimators=100, max_depth=depth)
    acc = cross_val_score(rf, X_train_res, y_train_res, cv=5)
    accuracy.append(np.mean(acc))
    pre = cross_val_score(rf, X_train_res, y_train_res, cv=5, scoring='precision')
    presision.append(np.mean(pre))
    rec = cross_val_score(rf, X_train_res, y_train_res, cv=5, scoring='recall')
    recall.append(np.mean(rec))

pd.DataFrame({"accuracy" : accuracy, "precision": presision, "recall" : recall}, index = depths)

Unnamed: 0,accuracy,precision,recall
2,0.795856,0.874638,0.699667
3,0.822903,0.902736,0.748585
4,0.84603,0.926388,0.745887
5,0.851166,0.927043,0.789677
6,0.861423,0.925152,0.774259
7,0.862713,0.917757,0.789677
8,0.866576,0.932001,0.78708
9,0.867825,0.926522,0.794805
10,0.866551,0.922171,0.794772
11,0.869115,0.920673,0.802498


#### Observations
We see much higher results for precision and recall which is fantastic! The both rise and fall slightly around higher depths, but an idea depth could be 20

#### Tuning max_features
- Same procedure as above!

In [31]:
# Same procedure as above
accuracy = []
presision = []
recall = []

# Same procedure as above
features = [15, 20, 25, 30, 35, 40, 45]
for feat in features:
    rf = RandomForestClassifier(n_estimators=100, max_features=feat)
    acc = cross_val_score(rf, X_train_res, y_train_res, cv=5)
    accuracy.append(np.mean(acc))
    pre = cross_val_score(rf, X_train_res, y_train_res, cv=5, scoring='precision')
    presision.append(np.mean(pre))
    rec = cross_val_score(rf, X_train_res, y_train_res, cv=5, scoring='recall')
    recall.append(np.mean(rec))

pd.DataFrame({"accuracy" : accuracy, "precision": presision, "recall" : recall}, index = features)

Unnamed: 0,accuracy,precision,recall
15,0.86909,0.92172,0.802498
20,0.86139,0.920326,0.812754
25,0.863937,0.916124,0.807592
30,0.865252,0.915265,0.817849
35,0.86651,0.914608,0.820446
40,0.867808,0.906,0.81019
45,0.866551,0.907478,0.812754


#### Observations
Again, same as previous cell! Best max_feature could be 25

In [32]:
# Same procedure as above
accuracy = []
presision = []
recall = []
scores = []

# Same procedure as above
depths = np.arange(2, 21)
features = [15, 20, 25, 30, 35, 40, 45]
for depth in depths:
    for feat in features:
        clf = RandomForestClassifier(n_estimators=100, max_depth=depth,  max_features=feat)
        acc = cross_val_score(clf, X_train_res, y_train_res, cv=5)
        pre = cross_val_score(clf, X_train_res, y_train_res, cv=5, scoring='precision')
        rec = cross_val_score(clf, X_train_res, y_train_res, cv=5, scoring='recall')
        score = [depth, feat, acc, pre, rec]
        scores.append(score)

In [33]:
u = pd.DataFrame(data = scores, columns = ["Depth", "Features", "Accuracy", "Precision", "Recall"])

In [34]:
# Same procedure as above
i = [bool(sum(i == [0.0, 0.0, 0.0, 0.0, 0.0])) for i in u.Precision.values]
u[i]

Unnamed: 0,Depth,Features,Accuracy,Precision,Recall


In [35]:
# Same procedure as above
u = u[~np.array(i)]
u["Accuracy"] = u.Accuracy.apply(np.mean)
u["Precision"] = u.Precision.apply(np.mean)
u["Recall"] = u.Recall.apply(np.mean)
u

Unnamed: 0,Depth,Features,Accuracy,Precision,Recall
0,2,15,0.808759,0.876056,0.725341
1,2,20,0.789479,0.872502,0.725408
2,2,25,0.786849,0.859484,0.689311
3,2,30,0.798412,0.878051,0.671362
4,2,35,0.785583,0.863884,0.684149
...,...,...,...,...,...
128,20,25,0.872945,0.920623,0.817882
129,20,30,0.871671,0.914253,0.815318
130,20,35,0.865244,0.907421,0.820446
131,20,40,0.865236,0.917518,0.822977


#### Observations

After a cursory look, the best depth = 20 and the best number of features = 40, but we are seeing much higher precision and recall scores here!

#### Procedure
Training new model with new "best" parameters

In [55]:
# Instantiate optimized model
rf_smt_optimized = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=40)

# Train the model on training data
rf_smt_optimized.fit(X_train_res, y_train_res);

In [58]:
# Print all metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
acc, prec, recall, f1

(0.8333333333333334, 0.68, 0.5483870967741935, 0.6071428571428571)

#### Observations
This was a pretty big struggle I had, understanding why there was such a drop in the metrics when inserting them into the random forest with SMOTE. As a team, we discussed the randomness of the train test split for SMOTE and random forests, but it was challenging to figure out why there was such a big drop than what was seen in the dataframes previously. 

## SMOTE on DT

#### Background
- Same notes as above!

### NOTE
I started with this model and then moved on to the RF with SMOTE afterwards!

In [77]:
# Instantiating model
smt = SMOTE(random_state=12)

# Splitting into training and testing with SMOTE
X_train_res, y_train_res = smt.fit_sample(X_train, y_train)

In [78]:
# Using a Decision Tree now
dt_smt = tree.DecisionTreeClassifier()

# Fitting the tree on the SMOTE-split data
dt_smt.fit(X_train_res, y_train_res)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [79]:
# Train error
y_train_pred = dt_smt.predict(X_train_res)
train_acc = dt_smt.score(X_train_res, y_train_res)
train_acc

0.987146529562982

In [80]:
# Test error
y_pred = dt_smt.predict(X_test)
test_acc = dt_smt.score(X_test, y_test)
test_acc

0.7424242424242424

#### Observations
Same procedure as above

#### Tuning depths
Same procedure as above

In [81]:
# Same procedure as above
accuracy = []
presision = []
recall = []

# Same procedure as above
depths = np.arange(2, 21)
for depth in depths:
    rf = tree.DecisionTreeClassifier()
    acc = cross_val_score(rf, X_train_res, y_train_res, cv=5)
    accuracy.append(np.mean(acc))
    pre = cross_val_score(rf, X_train_res, y_train_res, cv=5, scoring='precision')
    presision.append(np.mean(pre))
    rec = cross_val_score(rf, X_train_res, y_train_res, cv=5, scoring='recall')
    recall.append(np.mean(rec))

pd.DataFrame({"accuracy" : accuracy, "precision": presision, "recall" : recall}, index = depths)

Unnamed: 0,accuracy,precision,recall
2,0.78684,0.798464,0.786946
3,0.794541,0.796203,0.787013
4,0.793259,0.788485,0.797269
5,0.790703,0.796241,0.784482
6,0.786849,0.792113,0.792174
7,0.790695,0.796043,0.774226
8,0.79584,0.808327,0.781885
9,0.791977,0.802628,0.797269
10,0.792002,0.79123,0.779321
11,0.786857,0.795626,0.789577


#### Observations
Again, the metrics are much higher here. An ideal depth could be 5 even though we could choose 20 but that could lead to high variance which we want to avoid

#### Tuning max_features
Same procedure as above

In [82]:
# Same procedure as above
accuracy = []
presision = []
recall = []

# Same procedure as above
features = [15, 20, 25, 30, 35, 40, 45]
for feat in features:
    rf = tree.DecisionTreeClassifier()
    acc = cross_val_score(rf, X_train_res, y_train_res, cv=5)
    accuracy.append(np.mean(acc))
    pre = cross_val_score(rf, X_train_res, y_train_res, cv=5, scoring='precision')
    presision.append(np.mean(pre))
    rec = cross_val_score(rf, X_train_res, y_train_res, cv=5, scoring='recall')
    recall.append(np.mean(rec))

pd.DataFrame({"accuracy" : accuracy, "precision": presision, "recall" : recall}, index = features)

Unnamed: 0,accuracy,precision,recall
15,0.782986,0.790859,0.781885
20,0.795782,0.791982,0.799867
25,0.797122,0.790894,0.797269
30,0.793284,0.797376,0.789577
35,0.79713,0.790913,0.781885
40,0.790695,0.785432,0.794739
45,0.795823,0.797055,0.781885


#### Observations
Same as above! An idea number of features could be 15, again choosing the smallest number to factor in variance. We can also do this because the metrics are roughly uniform throughout the various depths, with minimal differences

In [83]:
# Same procedure as above
accuracy = []
presision = []
recall = []
scores = []

# Same procedure as above
depths = np.arange(2, 21)
features = [15, 20, 25, 30, 35, 40, 45]
for depth in depths:
    for feat in features:
        clf = tree.DecisionTreeClassifier(max_depth=depth,  max_features=feat)
        acc = cross_val_score(clf, X_train_res, y_train_res, cv=5)
        pre = cross_val_score(clf, X_train_res, y_train_res, cv=5, scoring='precision')
        rec = cross_val_score(clf, X_train_res, y_train_res, cv=5, scoring='recall')
        score = [depth, feat, acc, pre, rec]
        scores.append(score)

In [84]:
# Same procedure as above
v = pd.DataFrame(data = scores, columns = ["Depth", "Features", "Accuracy", "Precision", "Recall"])

In [85]:
i = [bool(sum(i == [0.0, 0.0, 0.0, 0.0, 0.0])) for i in v.Precision.values]
v[i]

Unnamed: 0,Depth,Features,Accuracy,Precision,Recall


In [86]:
# Same procedure as above
v = v[~np.array(i)]
v["Accuracy"] = v.Accuracy.apply(np.mean)
v["Precision"] = v.Precision.apply(np.mean)
v["Recall"] = v.Recall.apply(np.mean)
v

Unnamed: 0,Depth,Features,Accuracy,Precision,Recall
0,2,15,0.609396,0.726193,0.508425
1,2,20,0.611861,0.792191,0.524975
2,2,25,0.640265,0.696057,0.496337
3,2,30,0.626104,0.745926,0.385847
4,2,35,0.628677,0.634414,0.473260
...,...,...,...,...,...
128,20,25,0.793226,0.831339,0.787013
129,20,30,0.783027,0.775139,0.766434
130,20,35,0.802291,0.810400,0.776690
131,20,40,0.791927,0.784181,0.797169


#### Observations
The metrics, once again, are much higher with SMOTE for precision and recall which are the ones we emphasized in this project. After a cursory look, the best depth = 20 and the best number of features = 40. One question that did come up during this project was why the scores increase as the parameter values increase when they were roughly uniform in the previous dataframes (when we tuned parameters individually)?

#### Procedure 
Training new model with new "best" features

In [87]:
# Instantiate optimized model
dt_smt_optimized = tree.DecisionTreeClassifier(max_depth=20, max_features=40)

# Train the model on training data
dt_smt_optimized.fit(X_train_res, y_train_res);

In [91]:
# Print all metric scores
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
acc, prec, recall, f1

(0.7575757575757576,
 0.48717948717948717,
 0.6129032258064516,
 0.5428571428571428)

#### Observations
As previously seen, the metrics have taken a hit. 

#### Overall
Each model (DT, RF, DT with SMOTE, RF with SMOTE) offers some tradeoff in their metrics. It seemed to me that it would ultimately fall in our client's hands to decide which model they were after

## Functions

#### Procedure
Here, I created 4 pipelines, one for each model I built and tuned, and gave our client the option to choose which model they wanted based on the metrics they preferred to see.

In [51]:
def predict(clf, depth, num_features, patent_num, data):
    """ function that returns the prediction for the eligibility of a patent under 101 given a patent number
        1. clf: RandomForestClassifier or tree.DecisionTreeClassifier
        2. depth: any desired depth for the classifier you'd like
        3. num_features: any desired number of features for the classifier you'd like
        4. patent_num: chosen patent number or application number
        5. data: dataset housing patent and feature info """
    
    # Instantiate the classifier model
    clf = clf(max_depth=depth, max_features=num_features)
    
    # Train the model on the training data
    clf.fit(X_train, y_train);
    
    # Get the array of predictions made by the model 
    y_pred = clf.predict(X_test)
    
    # Find the correct index of prediction in y_pred based on the given patent number
    index = data.loc[data['Patent/Application'] == patent_num].index[0]
    
    # Print the classification, accuracy, precision, recall, and f1 scores for transparency
    print('Classification: {}'.format(y_pred[index]))
    print('Accuracy: {}'.format(accuracy_score(y_test, y_pred)))
    print('Precision: {}'.format(precision_score(y_test, y_pred)))
    print('Recall: {}'.format(recall_score(y_test, y_pred)))
    print('F1: {}'.format(f1_score(y_test, y_pred)))

In [4]:
def predict_rf_optimized(patent_num, data, clf=RandomForestClassifier, depth=19, features=40):   
    """ function that returns the prediction for the eligibility of a patent under 101 given a patent number 
    with optimized features for a Random Forest Classifier
        1. patent_num: chosen patent number or application number
        2. data: dataset housing patent and feature info
        3. clf: RandomForestClassifier
        4. depth: default set to 19 based on investigation of best results for precision and recall
        5. num_features: default set to 40 based on investigation of best results for precision and recall """
    
    # Instantiate the classifier model
    clf = clf(n_estimators=100, random_state=42, max_depth=depth, max_features=features)
    
    # Train the model on the training data
    clf.fit(X_train, y_train);
    
    # Get the array of predictions made by the model 
    y_pred = clf.predict(X_test)
    
    # Find the correct index of prediction in y_pred based on the given patent number
    index = data.loc[data['Patent/Application'] == patent_num].index[0]
    
    # Print the classification, accuracy, precision, recall, and f1 scores for transparency
    print('Classification: {}'.format(y_pred[index]))
    print('Accuracy: {}'.format(accuracy_score(y_test, y_pred)))
    print('Precision: {}'.format(precision_score(y_test, y_pred)))
    print('Recall: {}'.format(recall_score(y_test, y_pred)))
    print('F1: {}'.format(f1_score(y_test, y_pred)))

NameError: name 'RandomForestClassifier' is not defined

In [99]:
def predict_dt_optimized(patent_num, data, clf=tree.DecisionTreeClassifier, depth=18, features=15):
    """ function that returns the prediction for the eligibility of a patent under 101 given a patent number
    with optimized features for a Shallow Decision Tree
        1. patent_num: chosen patent number or application number
        2. data: dataset housing patent and feature info
        3. clf: tree.DecisionTreeClassifier
        4. depth: default set to 18 based on investigation of best results for precision and recall
        5. num_features: default set to 15 based on investigation of best results for precision and recall """
    
    # Instantiate the classifier model
    clf = clf(max_depth=depth, max_features=features)
    
    # Train the model on the training data
    clf.fit(X_train, y_train);
    
    # Get the array of predictions made by the model 
    y_pred = clf.predict(X_test)
    
    # Find the correct index of prediction in y_pred based on the given patent number
    index = data.loc[data['Patent/Application'] == patent_num].index[0]
    
    # Print classification, accuracy, precision, recall, and f1 scores
    print('Classification: {}'.format(y_pred[index]))
    print('Accuracy: {}'.format(accuracy_score(y_test, y_pred)))
    print('Precision: {}'.format(precision_score(y_test, y_pred)))
    print('Recall: {}'.format(recall_score(y_test, y_pred)))
    print('F1: {}'.format(f1_score(y_test, y_pred)))

In [100]:
def predict_smt_optimized(clf, patent_num, data, depth=20, features=40):
    """ function that returns the prediction for the eligibility of a patent under 101 given a patent number
    with optimized features for a Random Forest or Shallow Decision Tree optimized using SMOTE
        1. clf = RandomForestClassifier or tree.DecisionTreeClassifier
        2. patent_num: chosen patent number or application number
        3. data: dataset housing patent and feature info
        4. depth: default set to 18 based on investigation of best results for precision and recall
        5. num_features: default set to 15 based on investigation of best results for precision and recall """
    
    # Instantiate SMOTE
    smt = SMOTE(random_state=12)
    
    # Get X_train and y_train with SMOTE
    X_train_res, y_train_res = smt.fit_sample(X_train, y_train)
    
    # Instantiate the classifier model
    clf = clf(max_depth=depth, max_features=features)
    
    # Train the model on the training data
    clf.fit(X_train_res, y_train_res);
    
    # Get the array of predictions made by the model 
    y_pred = clf.predict(X_test)
    
    # Find the correct index of prediction in y_pred based on the given patent number
    index = data.loc[data['Patent/Application'] == patent_num].index[0]
    
    # Print classification, accuracy, precision, recall, and f1 scores
    print('Classification: {}'.format(y_pred[index]))
    print('Accuracy: {}'.format(accuracy_score(y_test, y_pred)))
    print('Precision: {}'.format(precision_score(y_test, y_pred)))
    print('Recall: {}'.format(recall_score(y_test, y_pred)))
    print('F1: {}'.format(f1_score(y_test, y_pred)))

#### Procedure
Example usages of each function above

In [54]:
predict(RandomForestClassifier, 18, 19, 8050982, final)

Classification: 0.0
Accuracy: 0.8409090909090909
Precision: 0.75
Recall: 0.4838709677419355
F1: 0.5882352941176471


In [55]:
predict_rf_optimized(8050982, final)

Classification: 0.0
Accuracy: 0.8409090909090909
Precision: 0.75
Recall: 0.4838709677419355
F1: 0.5882352941176471


In [56]:
predict_dt_optimized(8050982, final)

Classification: 0.0
Accuracy: 0.8181818181818182
Precision: 0.6129032258064516
Recall: 0.6129032258064516
F1: 0.6129032258064516


In [97]:
predict_smt_optimized(RandomForestClassifier, 8050982, final)

Classification: 0.0
Accuracy: 0.8333333333333334
Precision: 0.6956521739130435
Recall: 0.5161290322580645
F1: 0.5925925925925926


In [98]:
predict_smt_optimized(tree.DecisionTreeClassifier, 8050982, final)

Classification: 0.0
Accuracy: 0.7575757575757576
Precision: 0.4864864864864865
Recall: 0.5806451612903226
F1: 0.5294117647058824


# Improvements or Next Steps
- Try out Tensorflow or Keras
- More hyperparameter tuning
- Stratified k-fold cross val because we had an imbalanced dataset
- Investigated the important features and their values for each model, and trained on just those
- KNN and SVM as other models, in the spirit of the No Free Lunch theorem
- Create functions for when I was hyperparameter tuning!!