# Practice Notebook: New Models: Random Forest

## Question

Train a random forest model. The test set accuracy should be at least 0.88.

**Hint**

Try n_estimators values from 1 to 10. Pick the option with the best quality for the validation set.



In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression



In [None]:
#import our data and preview the data
diabetes_df = pd.read_csv('diabetes2.csv')
diabetes_df.head()
diabetes_df.sample(n=10)
diabetes_df.shape

(768, 9)

In [None]:
from pandas._libs.hashtable import duplicated
diabetes_df.isnull().sum() #no missing value
diabetes_df.duplicated().any() #no duplicated observation
diabetes_df.dtypes #all columns are numeric data type
diabetes_df.nunique() #check uniques values per columns. column 'Outcome' is classification label
diabetes_df.columns #columns start with capital


Index(['pregnancies', 'glucose', 'bloodpressure', 'skinthickness', 'insulin',
       'bmi', 'diabetespedigreefunction', 'age', 'outcome'],
      dtype='object')

In [None]:

#standardise the columns name
diabetes_df.columns = diabetes_df.columns.str.lower().str.strip()
diabetes_df.head()


In [None]:
#data modeling
#since there is no separate test dataset create train and validation dataset
#use train and test split function. 
#when using the valid_df no model achieved accuracy of >=0.85
train_df, valid_df = train_test_split(diabetes_df, test_size=1, random_state=1234)
print(train_df.shape)
print(valid_df.shape)

(767, 9)
(1, 9)


In [None]:
#create features and target for both train and test
features_train = train_df.drop(columns=['outcome'])
target_train = train_df['outcome']
features_valid = valid_df.drop(columns=['outcome'])
target_valid = valid_df['outcome']

#create a model for Decision Trees, Random Forest and Logistic Regression
#model for Decision Trees, declare and find the ideal depth for the tree
for d in range(1, 11, 1):
  tree_model = DecisionTreeClassifier(random_state=1234, max_depth=d)
  tree_model.fit(features_train, target_train)  #train the model
  #check for accuracy
  print(f'Decision tree has accuracy of: {tree_model.score(features_train, target_train)} for depth of: {d}')

#declare model for random forest and find the best n_estimator value
for n in range(1,20,1):
  forest_model = RandomForestClassifier(random_state=1234, n_estimators=n)
  forest_model.fit(features_train, target_train)
  print(f'Random forest has accuracy of: {forest_model.score(features_train, target_train)} for n={n}')

#declare a model for logistic regression
log_model = LogisticRegression(random_state=1234, solver='liblinear')
log_model.fit(features_train, target_train)
print(f'logistic regression has accuracy of: {log_model.score(features_train, target_train)}')




Decision tree has accuracy of: 0.7353324641460235 for depth of: 1
Decision tree has accuracy of: 0.771838331160365 for depth of: 2
Decision tree has accuracy of: 0.7757496740547588 for depth of: 3
Decision tree has accuracy of: 0.7913950456323338 for depth of: 4
Decision tree has accuracy of: 0.8370273794002607 for depth of: 5
Decision tree has accuracy of: 0.8513689700130378 for depth of: 6
Decision tree has accuracy of: 0.8917861799217731 for depth of: 7
Decision tree has accuracy of: 0.9282920469361148 for depth of: 8
Decision tree has accuracy of: 0.9569752281616688 for depth of: 9
Decision tree has accuracy of: 0.970013037809648 for depth of: 10
Random forest has accuracy of: 0.8917861799217731 for n=1
Random forest has accuracy of: 0.9074315514993481 for n=2
Random forest has accuracy of: 0.9595827900912647 for n=3
Random forest has accuracy of: 0.9556714471968709 for n=4
Random forest has accuracy of: 0.9739243807040417 for n=5
Random forest has accuracy of: 0.9687092568448501 f

###Finding and Recommendation
####Finding
*   Out of the 3 models used Decision Trees, Random Forest and Logistic Regression only Decision Trees, Random Forest meets the criteria of more than 0.85 prediction accuracy.
*    Decision Trees gives a prediction accuracy of >0.85 for tree depth of >=6
*    Random Forest gives a prediction of >0.85 for n_estimator >=1

####Recommendation
Random Forest has the best accuracy but slow while Decision Tree has low accuracy but fast. the most optimal model would be Decision Trees tree depth of ==6 because it meets the criteria and fast




