<a href="https://colab.research.google.com/github/justin-hsieh/DS-Unit-2-Applied-Modeling/blob/master/assignment_applied_modeling_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Applied Modeling, Module 1

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Choose which observations you will use to train, validate, and test your model. And which observations, if any, to exclude.
- [ ] Determine whether your problem is regression or classification.
- [ ] Choose your evaluation metric.
- [ ] Begin with baselines: majority class baseline for classification, or mean baseline for regression, with your metric of choice.
- [ ] Begin to clean and explore your data.
- [ ] Choose which features, if any, to exclude. Would some features "leak" information from the future?

## Reading
- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [How Shopify Capital Uses Quantile Regression To Help Merchants Succeed](https://engineering.shopify.com/blogs/engineering/how-shopify-uses-machine-learning-to-help-our-merchants-grow-their-business)
- [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), **by Lambda DS3 student** Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Simple guide to confusion matrix terminology](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) by Kevin Markham, with video
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)

In [0]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error,r2_score,mean_squared_error

In [13]:
df = pd.read_csv('openpowerlifting.csv')
drops = ['Squat4Kg', 'Bench4Kg', 'Deadlift4Kg','Country','Place','Squat1Kg',
        'Squat2Kg','Squat3Kg','Bench1Kg','Bench2Kg','Bench3Kg','Deadlift1Kg',
        'Deadlift2Kg','Deadlift3Kg']
df = df.drop(columns=drops)
#df.dropna(inplace=True)
df.shape

(441706, 23)

In [5]:
df.head()

Unnamed: 0,Name,Sex,Event,Equipment,Age,AgeClass,Division,BodyweightKg,WeightClassKg,Best3SquatKg,Best3BenchKg,Best3DeadliftKg,TotalKg,Wilks,McCulloch,Glossbrenner,IPFPoints,Tested,Federation,Date,MeetCountry,MeetState,MeetName
8164,Albert Kocikowski,M,SBD,Raw,16.0,16-17,Youth A,93.8,105,150.0,107.5,140.0,397.5,248.7,281.03,239.44,335.57,Yes,BVDK,2018-09-08,Germany,MV,LM Mecklenburg-Vorpommern KDK
8165,Hamidreza Jadali,M,SBD,Raw,20.5,20-23,Juniors,69.5,74,130.0,90.0,145.0,365.0,275.04,280.54,266.55,379.34,Yes,BVDK,2018-09-08,Germany,MV,LM Mecklenburg-Vorpommern KDK
8166,Rustam Hadschi,M,SBD,Raw,18.0,18-19,Youth A,69.4,74,155.0,100.0,175.0,430.0,324.38,343.84,314.38,462.54,Yes,BVDK,2018-09-08,Germany,MV,LM Mecklenburg-Vorpommern KDK
8167,Tim Hajo Paltinat,M,SBD,Raw,17.5,18-19,Youth A,67.8,74,107.5,80.0,135.0,322.5,247.77,262.64,240.3,332.26,Yes,BVDK,2018-09-08,Germany,MV,LM Mecklenburg-Vorpommern KDK
8169,Jan Schnoor,M,SBD,Raw,25.0,24-34,Open,125.3,120+,255.0,172.5,275.0,702.5,400.12,400.12,381.08,554.12,Yes,BVDK,2018-09-08,Germany,MV,LM Mecklenburg-Vorpommern KDK


In [14]:
df.isna().sum()

Name                    0
Sex                     0
Event                   0
Equipment               0
Age                298228
AgeClass           291034
Division             5496
BodyweightKg         9889
WeightClassKg        6115
Best3SquatKg        94886
Best3BenchKg        47395
Best3DeadliftKg     92078
TotalKg             40918
Wilks               46623
McCulloch           46744
Glossbrenner        46624
IPFPoints           53512
Tested              87181
Federation              1
Date                    1
MeetCountry             1
MeetState           95374
MeetName                1
dtype: int64

In [6]:
df.columns

Index(['Name', 'Sex', 'Event', 'Equipment', 'Age', 'AgeClass', 'Division',
       'BodyweightKg', 'WeightClassKg', 'Best3SquatKg', 'Best3BenchKg',
       'Best3DeadliftKg', 'TotalKg', 'Wilks', 'McCulloch', 'Glossbrenner',
       'IPFPoints', 'Tested', 'Federation', 'Date', 'MeetCountry', 'MeetState',
       'MeetName'],
      dtype='object')

In [7]:
df['Equipment'].value_counts()

Single-ply    17572
Raw           15832
Wraps          1456
Multi-ply       120
Name: Equipment, dtype: int64

In [8]:
df.Sex.value_counts()

M    27310
F     7670
Name: Sex, dtype: int64

In [9]:
squat_mean = [df['Best3SquatKg'].mean()]*len(df['Best3SquatKg'])
squat = df['Best3SquatKg']
print(mean_absolute_error(squat, squat_mean))
print(r2_score(squat, squat_mean))

49.470480585737015
1.1102230246251565e-16


In [10]:
bench_mean = [df['Best3BenchKg'].mean()]*len(df['Best3BenchKg'])
bench = df['Best3BenchKg']
print(mean_absolute_error(bench, bench_mean))
print(r2_score(bench, bench_mean))

34.92905712984417
0.0


In [11]:
deadlift_mean = [df['Best3DeadliftKg'].mean()]*len(df['Best3DeadliftKg'])
deadlift = df['Best3DeadliftKg']
print(mean_absolute_error(deadlift, deadlift_mean))
print(r2_score(deadlift, deadlift_mean))

45.47948744070369
0.0
