## 1. Using Pandas to Prepare Data
Import pandas and edit the dataframe

Steps:
1. Convert csv to dataframe
2. Clean out the empty data
3. Check describe, head, columns
4. Split data into X, y
5. Split data into training and testing

In [2]:
import pandas as pd # import pandas

In [3]:
data = pd.read_csv('pima-indians-diabetes.csv') # convert csv to dataframe so we can work with it

In [4]:
data = data.dropna(axis=0) # remove empty data

### Check over the data
- describe()
- head()
- column

In [5]:
data.describe()

Unnamed: 0,6,148,72,35,0,33.6,0.627,50,1
count,767.0,767.0,767.0,767.0,767.0,767.0,767.0,767.0,767.0
mean,3.842243,120.859192,69.101695,20.517601,79.90352,31.990482,0.471674,33.219035,0.34811
std,3.370877,31.978468,19.368155,15.954059,115.283105,7.889091,0.331497,11.752296,0.476682
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.2435,24.0,0.0
50%,3.0,117.0,72.0,23.0,32.0,32.0,0.371,29.0,0.0
75%,6.0,140.0,80.0,32.0,127.5,36.6,0.625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [10]:
data.head() # first 5 rows

Unnamed: 0,6,148,72,35,0,33.6,0.627,50,1
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0


In [13]:
data.columns # equvilent to keys

Index(['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1'], dtype='object')

### Split into X and y
X is the input data
y is the output data - the target of our predictions

If the data is not all numerical, you need to fix it here. For example, if it's three different types of Iris classifications you need to write some code to assign the numbers 0, 1, and 2 to the three types.

In [14]:
# usually you'd do y.ColumnName since it's only one column
# but this time since it's a called a number '1' I'm using the list method

y_columns = ['1']
y = data[y_columns]

In [15]:
x_columns = ['6', '148', '72', '35', '0', '33.6', '0.627', '50'] # copied from data.columns, everything but the target column
X = data[x_columns]

### Train test split

In [16]:
from sklearn.model_selection import train_test_split

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=256)
# X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.2, random_state=256)

## 2. Building the Model
Now that all the data preparation is done, we can build the actual tree


In [49]:
from sklearn.tree import DecisionTreeRegressor
# Create the model
model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=256) # you can also just not specify the amount of leaf nodes

In [50]:
# Fit the model to the training data
model.fit(X_train, y_train)

DecisionTreeRegressor(max_leaf_nodes=100, random_state=256)

#### Predictions
Find out what the model predicts based on the testing input data

In [51]:
predictions = model.predict(X_test)
predictions[:5] # print out the first five rows

array([1.        , 1.        , 1.        , 1.        , 0.00877193])

## 3. Calculating Accuracy
import mean absolute error to find out how off you are each time

In [52]:
from sklearn.metrics import mean_absolute_error

In [55]:
# Since the predictions are not all 1 or 0 we need to round them
rounded_predictions = []
for p in predictions:
    rounded_predictions.append(int(p))

In [56]:
rounded_predictions[:5]

[1, 1, 1, 1, 0]

Now compare the predictions from X_test to the actual answers in y_test

In [57]:
mae = mean_absolute_error(rounded_predictions, y_test)

In [58]:
mae

0.3177083333333333

Meaning that we are off by 0.3 each time

Lastly, you can play around with different leaf nodes, find the best value and train the final model with all the data (X and y)

### Accuracy Metric
Check the precentage of correct predictions or the amount of correct predictions

In [60]:
from sklearn.metrics import accuracy_score
precentage_accuracy = accuracy_score(rounded_predictions, y_test)
num_of_accurate_predictions = accuracy_score(rounded_predictions, y_test, normalize=False)

In [61]:
precentage_accuracy # aka 131/192

0.6822916666666666

In [62]:
num_of_accurate_predictions

131

In [64]:
num_of_samples = accuracy_score(y_test, y_test, normalize=False)
num_of_samples

192

## 4. Random Forest Model
Import randomeforestregressor and make a random forest model.
Everything is done the same, but a random forest has multiple decision trees.
The predictions from each tree are averaged resulting in a better prediction.

In [65]:
from sklearn.ensemble import RandomForestRegressor

In [70]:
len(y_train)

575

In [72]:
forest = RandomForestRegressor(random_state=256)
forest.fit(X_train, y_train)

  forest.fit(X_train, y_train)


RandomForestRegressor(random_state=256)

In [75]:
predictions = forest.predict(X_test)
# Since the predictions are not all 1 or 0 we need to round them
rounded_predictions = []
for p in predictions:
    rounded_predictions.append(int(p))
    
mae = mean_absolute_error(y_test, rounded_predictions)
mae

0.390625

In [76]:
# Controlling other values
model_3 = RandomForestRegressor(n_estimators=100, criterion='mae', random_state=0)
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)

## 5. Final Predictions to csv
With the final model, get the final predictions and change from dataframe to csv


In [79]:
predictions_final = forest.predict(X_test) # get predictions

In [81]:
predictions_final[:5]

array([0.54, 0.64, 0.21, 0.87, 0.01])

In [85]:
output = pd.DataFrame({'Id': X_test.index, 'Diabetic': predictions_final}) # save to new dataframe with indexes

In [86]:
output.to_csv('submission.csv', index=False)

Then the csv file will appear in the directory