 ### Student Name
 
Please add your name to this markdown cell.

## Submitting your homework!

Three steps are required for this week's homework once you have completed it:

- __Share__ this notebook with _libraryjuicepresspython@gmail.com_ so it can be marked in Colab
- __Print as PDF__ and upload to the the week's homework submission location on the Library Juice Academy LMS
- __Post__ to the Week 4 Forum

# Python for Librarians - Week 4 Homework

## Machine Learning Madness

We are going to take our new knowledge of ML and apply it to a different dataset. This week we are going to look at a concatenated version of this [dataset](https://datadryad.org/stash/dataset/doi:10.5061/dryad.2h4j5). 

Let's see if you can infer what our model features and target will be once we look at the columns.

In [None]:
import pandas as pd

#We'll draw a graph later on
import matplotlib.pyplot as plt
import numpy as np

#Our 'Machine Learning pieces'
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import export_text
from sklearn import metrics 
from sklearn import tree


#Suppress the distracting warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
#Load up our data
citation_data = pd.read_csv("https://raw.githubusercontent.com/elibtronic/lja_datasets/master/week_4_homework_citation.csv")

In [None]:
#Run this cell a few times to get a look at the data
citation_data.sample(10)

In [None]:
#Run this cell to generate some summary statistics on the data
citation_data.describe()

Our data is ~1300 lines from the original dataset with the following columns:

- Score1 - a score assigned by assesor 1
- Score2 - a score assigned by assessor 2
- IF2 - the two year impact factor score 
- IF5 - the five year impact factor score
- TopCitation - If the citation is among the top 10% of all citations in this dataset

We are going to build a **Decision Tree Classifer** ML model. We'll use that model to see if we can predict if a citation from the dataset will be in the top 10% of all cited articles.

This seems like an esoteric question to ask and answer? Yes, I'll give you that. We are trying to see if 4 different _scores_ can be used to determine what class we can put a citation in: 

- The top 10% (those worth a closer look)
- The bottom 90% (those that we don't have to look at any closer)

We are also going to see how accurate we can get a model by tweaking some parameters.

### Uncommenting

Up until this point we have been using comments to leave notes to the people that read the code. We can also use comments to block out different lines of code we don't want Python to run. The following cell has an example. Remove the correct comment symbol `#` so that the code executes. We'll use uncommenting in _Q1_ to select our features and target.

In [None]:
days_in_a_year = 365
#print(days_in_a_year)

## Q1

In the next cell uncomment the correct lines to identify the features and the targets that this model will be built with. 

You need to uncomment 1 line between lines: 3 & 10 and uncomment one line between lines: 14 & 18. Your answer for _Q1_ will be the uncommented lines of code.

In [None]:
#Which set of columns will be our features?

#citation_features = ["Score1"]
#citation_features = ["Score1","Score2","IF2","IF5","TopCitation"]
#citation_features = ["Score1","Score2"]
#citation_features = ["Score1","Score2","IF2","IF5"]
#citation_features = ["Score1","Score2","IF2"]
#citation_features = ["TopCitation"]
#citation_features = ["TopCitation","Score1"]
#citation_features = ["TopCitation","IF2","IF5"]

#Which column will be our target?

#citation_target = ["Score1"]
#citation_target = ["Score2"]
#citation_target = ["IF2"]
#citation_target = ["IF5"]
#citation_target = ["TopCitation"]

X = citation_data[citation_features]
y = citation_data[citation_target]

Run the next cell to build the Decision Tree Classifier model and get the accuracy of the model using our basic set of parameters.

In [None]:
#We'll start with 20 just for fun
test_percent = 20

X_train, X_test, y_train, y_test = train_test_split(X, \
                                                    y, \
                                                    test_size=test_percent/100.0)
# Create Decision Tree classifer object
treeClass = DecisionTreeClassifier()

# Train
treeClass = treeClass.fit(X_train,y_train)

#Predict
y_pred = treeClass.predict(X_test)

#Accuracy?
print("Accuracy of our model: ")
metrics.accuracy_score(y_test,y_pred)

## Visualizing our tree

Let's have a look at the tree we created without changing any hyperparameters.

In [None]:
printed_tree = export_text(treeClass,feature_names=citation_features)
print(printed_tree)

## Q2

Let's see what effect changing the testing percent has on accuracy. In the cell below add some values in the `test_percents` list to test this and view the corresponding graph. You need to modify line 3.

Your graph will be the answer to _Q2_.

In [None]:
#add some values between 1 - 99 in a comman separated list in the next line

testing_percents = [,,,,]

accuracy = []
training_percents = []

for test_ratio in sorted(testing_percents):
    X_train, X_test, y_train, y_test = train_test_split(X, \
                                                        y, \
                                                        test_size=test_ratio/100.0)
    treeClassTest = DecisionTreeClassifier()
    treeClassTest = treeClassTest.fit(X_train,y_train)
    y_pred = treeClassTest.predict(X_test)
    score = metrics.accuracy_score(y_test,y_pred)
    accuracy.append(score)
    training_percents.append(100 - test_ratio)

    
plt.plot(training_percents,accuracy)
plt.ylabel("Accuracy in %")
plt.xlabel("Training Size %")
plt.show()

## Q3

Let's wee what effect changing the maximum depth has on accuracy. In the cell below add some values in the max_options list to test this and view the corresponding graph. You need to modify line 6.

Your graph will be the answer to _Q3_.

In [None]:
#We'll fix this at 20% for this investigation
test_percent = 20

#add some values between 1 - 30 in a comma separated list in the next line

max_options = [,,,,]

accuracy = []
tree_max = []

for max_d in sorted(max_options):
    X_train, X_test, y_train, y_test = train_test_split(X, \
                                                        y, \
                                                        test_size=test_percent/100.0)
    
    #We set maximum depth in the DecisionTreeClassifer when we first create the variable
    treeClassTest = DecisionTreeClassifier(max_depth=max_d)
    treeClassTest = treeClassTest.fit(X_train,y_train)
    y_pred = treeClassTest.predict(X_test)
    score = metrics.accuracy_score(y_test,y_pred)
    accuracy.append(score)
    tree_max.append(max_d)

    
plt.plot(max_options,accuracy)
plt.ylabel("Accuracy")
plt.xlabel("Maximum Depth of Tree")
plt.show()

## Maximizing and minimizing our accuracy

Use the cell below to answer _Q5_. You can modify the values on line 2 & 3

In [None]:

test_percent = 
max_d = 

X_train, X_test, y_train, y_test = train_test_split(X, \
                                                    y, \
                                                    test_size=test_percent/100.0)
    
treeClassTest = DecisionTreeClassifier(max_depth=max_d)
treeClassTest = treeClassTest.fit(X_train,y_train)
y_pred = treeClassTest.predict(X_test)

#Accuracy?
print("Calculated accuracy: ")
print(metrics.accuracy_score(y_test,y_pred))

print("\nTree generated")
printed_tree = export_text(treeClass,feature_names=citation_features)
print(printed_tree)


## Q5

What combination of parameters above led to the highest accuracy? Don't worry if you can't get a perfect tree. The goal of this question is to experiment with the 2 parameters to see if you can modify both at once to get a better score.

I got the highest accuracy by setting

- Testing percentage to:
- Maximimum depth of the tree to:

## From Trees to Forests

We've explored the accuracy of our ML model when we just created one tree. Let's see if we can increase this with a forest of trees for our citation information. Try some different values in for the two parameters on line 1 and line 2 to answer _Q6_.

In [None]:
#Pick a values between 1 and 99
test_percent = 
#Pick a values between 10 - 50
number_estimators = 

X_train, X_test, y_train, y_test = train_test_split(X, \
                                                    y, \
                                                    test_size=test_percent/100.0)

#Create Random Classifier
clf = RandomForestClassifier(n_estimators=number_estimators)

#Train
clf.fit(X_train,np.ravel(y_train))

#Predict
y_pred = clf.predict(X_train)

#Accuracy
print("Accuracy?")
print(metrics.accuracy_score(y_train,y_pred))


#Visualize the first tree in this forest
print("\nFirst tree in this forest")
printed_tree = export_text(clf.estimators_[0],feature_names = citation_features)
print(printed_tree)

## Q6

What combination of parameters above led to the highest accuracy? Don't worry if you can't get a perfect forest/tree. The goal of this question is to experiment with the 2 parameters to see if you can modify both at once to get a better score.


I got the highest accuracy by setting

- Testing percentage to:
- Number of estimators to:

## Q7


Which model (**decision tree** or **forest**) was easier to maximum accuracy for?

The model that was easiest to maximize was...

## Q8

Can you take a guess as to why your answer to _Q7_ was the way it was?

I think it was because...

# Congratulations!

You've now officially completed Python for Librarians! Be sure to complete the last three steps:

- Save your notebook as PDF and upload
- Share your completed notebook with `libraryjuicepresspython@gmail.com` so I can get a shared copy of your work.
- Head over to the Week 4 Homework Forum to make your last post

-----

# Thanks! I hope you enjoyed this

Thanks for taking this class and giving Python a try. I have put together all of the datasets I've used for this class in a [GitHub repository](https://github.com/elibtronic/lja_datasets) Feel free to open up a fresh Google Collab notebook and load a CSV file to perform some analysis or to build some machine learning models. 

Please drop me a [line](https://twitter.com/elibtronic/) if you'd like to talk more about these topics or if you have a notebook that you'd like to share.