<a href="https://colab.research.google.com/github/Mike-Xie/DS-Unit-2-Applied-Modeling/blob/master/module2/Mike_Xie_assignment_applied_modeling_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---

# Wrangle ML datasets

- [ ] Continue to clean and explore your data. 
- [ ] For the evaluation metric you chose, what score would you get just by guessing?
- [ ] Can you make a fast, first model that beats guessing?

**We recommend that you use your portfolio project dataset for all assignments this sprint.**

**But if you aren't ready yet, or you want more practice, then use the New York City property sales dataset for today's assignment.** Follow the instructions below, to just keep a subset for the Tribeca neighborhood, and remove outliers or dirty data. [Here's a video walkthrough](https://youtu.be/pPWFw8UtBVg?t=584) you can refer to if you get stuck or want hints!

- Data Source: [NYC OpenData: NYC Citywide Rolling Calendar Sales](https://data.cityofnewyork.us/dataset/NYC-Citywide-Rolling-Calendar-Sales/usep-8jbt)
- Glossary: [NYC Department of Finance: Rolling Sales Data](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page)

In [0]:
%%capture
import sys

if 'google.colab' in sys.modules:
    # Install packages in Colab
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

In [1]:
!wget https://github.com/bowswung/voobly-scraper/raw/master/data/MatchData/20190208/matchDump.csv.zip

--2019-12-18 00:00:59--  https://github.com/bowswung/voobly-scraper/raw/master/data/MatchData/20190208/matchDump.csv.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/bowswung/voobly-scraper/master/data/MatchData/20190208/matchDump.csv.zip [following]
--2019-12-18 00:00:59--  https://media.githubusercontent.com/media/bowswung/voobly-scraper/master/data/MatchData/20190208/matchDump.csv.zip
Resolving media.githubusercontent.com (media.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49029198 (47M) [application/zip]
Saving to: ‘matchDump.csv.zip’


2019-12-18 00:01:01 (108 MB/s) - ‘matchDump.csv.zip’ saved [49029198/49029198]



In [2]:
!unzip matchDump.csv.zip

Archive:  matchDump.csv.zip
  inflating: matchDump.csv           


In [3]:
!head matchDump.csv

MatchId,MatchUrl,MatchDate,MatchDuration,MatchLadder,MatchMap,MatchMods,MatchPlayerId,MatchPlayerName,MatchPlayerTeam,MatchPlayerCivId,MatchPlayerCivName,MatchPlayerWinner,MatchPlayerPreRating,MatchPlayerPostRating,MatchPlayerRecording
17827685,https://www.voobly.com/match/view/17827685,2018-05-27T18:14:00,520,Match Stats Only,=V= CBA Hero AC V13.scx,v1.5 Beta R6,123972487,___Oreo,1,9,Saracens,1,1584,1600,
17827685,https://www.voobly.com/match/view/17827685,2018-05-27T18:14:00,520,Match Stats Only,=V= CBA Hero AC V13.scx,v1.5 Beta R6,125049367,___Dm,2,9,Saracens,0,1616,1600,
17827728,https://www.voobly.com/match/view/17827728,2018-05-27T18:21:00,661,Match Stats Only,=V= CBA Hero AC V13.scx,v1.5 Beta R6,123972487,___Oreo,1,13,Celts,1,1584,1600,
17827728,https://www.voobly.com/match/view/17827728,2018-05-27T18:21:00,661,Match Stats Only,=V= CBA Hero AC V13.scx,v1.5 Beta R6,125049367,___Dm,2,13,Celts,0,1616,1600,
17832641,https://www.voobly.com/match/view/17832641,2018-05-28T11:25:00

In [4]:
import pandas as pd
pd.options.display.max_rows = 999
pd.options.display.max_columns = 100

df = pd.read_csv('matchDump.csv', header=0, engine='python')
df.shape


(1263808, 16)

In [5]:
df.head(30)

Unnamed: 0,MatchId,MatchUrl,MatchDate,MatchDuration,MatchLadder,MatchMap,MatchMods,MatchPlayerId,MatchPlayerName,MatchPlayerTeam,MatchPlayerCivId,MatchPlayerCivName,MatchPlayerWinner,MatchPlayerPreRating,MatchPlayerPostRating,MatchPlayerRecording
0,17827685,https://www.voobly.com/match/view/17827685,2018-05-27T18:14:00,520,Match Stats Only,=V= CBA Hero AC V13.scx,v1.5 Beta R6,123972487,___Oreo,1,9,Saracens,1,1584,1600,
1,17827685,https://www.voobly.com/match/view/17827685,2018-05-27T18:14:00,520,Match Stats Only,=V= CBA Hero AC V13.scx,v1.5 Beta R6,125049367,___Dm,2,9,Saracens,0,1616,1600,
2,17827728,https://www.voobly.com/match/view/17827728,2018-05-27T18:21:00,661,Match Stats Only,=V= CBA Hero AC V13.scx,v1.5 Beta R6,123972487,___Oreo,1,13,Celts,1,1584,1600,
3,17827728,https://www.voobly.com/match/view/17827728,2018-05-27T18:21:00,661,Match Stats Only,=V= CBA Hero AC V13.scx,v1.5 Beta R6,125049367,___Dm,2,13,Celts,0,1616,1600,
4,17832641,https://www.voobly.com/match/view/17832641,2018-05-28T11:25:00,762,Match Stats Only,=V= CBA Hero AC V13.scx,v1.5 Beta R6,124815560,SOOR_DARA,1,14,Spanish,1,1584,1600,
5,17832641,https://www.voobly.com/match/view/17832641,2018-05-28T11:25:00,762,Match Stats Only,=V= CBA Hero AC V13.scx,v1.5 Beta R6,124693469,danyal__,2,14,Spanish,0,1616,1600,
6,17839373,https://www.voobly.com/match/view/17839373,2018-05-29T17:04:00,1073,Match Stats Only,=V= TDII TeamBonus v3.scx,"v1.5 Beta R6, TDII TeamBonus",125043437,cRiukc,1,17,Huns,1,1584,1600,
7,17839373,https://www.voobly.com/match/view/17839373,2018-05-29T17:04:00,1073,Match Stats Only,=V= TDII TeamBonus v3.scx,"v1.5 Beta R6, TDII TeamBonus",124157048,_SickBoy,2,16,Mayans,0,1616,1600,
8,17839373,https://www.voobly.com/match/view/17839373,2018-05-29T17:04:00,1073,Match Stats Only,=V= TDII TeamBonus v3.scx,"v1.5 Beta R6, TDII TeamBonus",*VooblyErrorPlayerNotFound*,"<td valign=""bottom"">Atabeg Zangi (Computer)</td>",*VooblyErrorPlayerNotFound*,*VooblyErrorPlayerNotFound*,*VooblyErrorPlayerNotFound*,*VooblyErrorPlayerNotFound*,*VooblyErrorPlayerNotFound*,*VooblyErrorPlayerNotFound*,*VooblyErrorPlayerNotFound*
9,17839373,https://www.voobly.com/match/view/17839373,2018-05-29T17:04:00,1073,Match Stats Only,=V= TDII TeamBonus v3.scx,"v1.5 Beta R6, TDII TeamBonus",*VooblyErrorPlayerNotFound*,"<td valign=""bottom"">Rey Alarico II (Computer)<...",*VooblyErrorPlayerNotFound*,*VooblyErrorPlayerNotFound*,*VooblyErrorPlayerNotFound*,*VooblyErrorPlayerNotFound*,*VooblyErrorPlayerNotFound*,*VooblyErrorPlayerNotFound*,*VooblyErrorPlayerNotFound*


# Choose Target

In [0]:
# I want to predict which player wins

target = 'MatchPlayerWinner'

In [7]:
df[target].describe()

count     1263808
unique          3
top             0
freq       647589
Name: MatchPlayerWinner, dtype: object

In [8]:
df['MatchLadder'].value_counts()

RM - Team           723786
RM - 1v1            411718
Match Stats Only     81083
DM - Team            40249
DM - 1v1              6972
Name: MatchLadder, dtype: int64

In [9]:
rm_1v1 = df[df['MatchLadder'] == 'RM - 1v1']

rm_1v1.shape

(411718, 16)

In [10]:
rm_1v1.head()

Unnamed: 0,MatchId,MatchUrl,MatchDate,MatchDuration,MatchLadder,MatchMap,MatchMods,MatchPlayerId,MatchPlayerName,MatchPlayerTeam,MatchPlayerCivId,MatchPlayerCivName,MatchPlayerWinner,MatchPlayerPreRating,MatchPlayerPostRating,MatchPlayerRecording
728,18658367,https://www.voobly.com/match/view/18658367,2018-10-26T14:19:00,1055,RM - 1v1,Arabia,"v1.5 Beta R7, WololoKingdoms",125152121,Tuborger,1,16,Mayans,1,1599,1615,
729,18658367,https://www.voobly.com/match/view/18658367,2018-10-26T14:19:00,1055,RM - 1v1,Arabia,"v1.5 Beta R7, WololoKingdoms",125151302,Sarah1409,2,11,Vikings,0,1600,1584,
730,18658419,https://www.voobly.com/match/view/18658419,2018-10-26T14:38:00,3361,RM - 1v1,Custom,1.4 RC,123628860,PL_Hitorek__,1,18,Koreans,1,1629,1642,
731,18658419,https://www.voobly.com/match/view/18658419,2018-10-26T14:38:00,3361,RM - 1v1,Custom,1.4 RC,123805002,n30n,2,17,Huns,0,1548,1535,
732,18658476,https://www.voobly.com/match/view/18658476,2018-10-26T14:55:00,3369,RM - 1v1,Nomad,"v1.5 Beta R7, WololoKingdoms",123834564,ppwudi,1,6,Chinese,1,1975,1995,


In [11]:
rm_1v1.tail()

Unnamed: 0,MatchId,MatchUrl,MatchDate,MatchDuration,MatchLadder,MatchMap,MatchMods,MatchPlayerId,MatchPlayerName,MatchPlayerTeam,MatchPlayerCivId,MatchPlayerCivName,MatchPlayerWinner,MatchPlayerPreRating,MatchPlayerPostRating,MatchPlayerRecording
1263771,19250562,https://www.voobly.com/match/view/19250562,2019-02-07T19:44:00,4797,RM - 1v1,Arabia,"v1.5 Beta R7, WololoKingdoms",123904589,YYY___YYY,2,8,Persians,0,1695,1678,https://www.voobly.com/files/view/51848193/82v...
1263772,19250571,https://www.voobly.com/match/view/19250571,2019-02-07T19:45:00,2501,RM - 1v1,Custom,"v1.5 Beta R7, WololoKingdoms",125198457,killpigboy,1,13,Celts,1,1933,1948,https://www.voobly.com/files/view/51848229/e9x...
1263773,19250571,https://www.voobly.com/match/view/19250571,2019-02-07T19:45:00,2501,RM - 1v1,Custom,"v1.5 Beta R7, WololoKingdoms",124772915,BarT,2,12,Mongols,0,1909,1894,https://www.voobly.com/files/view/51848226/2cx...
1263774,19250581,https://www.voobly.com/match/view/19250581,2019-02-07T19:47:00,4080,RM - 1v1,Custom,"v1.5 Beta R7, WololoKingdoms",156947,Krok_,1,24,Portuguese,1,1542,1555,https://www.voobly.com/files/view/51848307/pva...
1263775,19250581,https://www.voobly.com/match/view/19250581,2019-02-07T19:47:00,4080,RM - 1v1,Custom,"v1.5 Beta R7, WololoKingdoms",125214232,Thendo,2,19,Italians,0,1479,1466,https://www.voobly.com/files/view/51848258/uwe...


In [12]:
y = rm_1v1[target]

y.nunique()

3

In [13]:
y.value_counts() # baseline is 50/50 since only P1 or P2 can win a 1v1

1                              205816
0                              205816
*VooblyErrorPlayerNotFound*        86
Name: MatchPlayerWinner, dtype: int64

In [14]:
z = df[target]

z.value_counts()

0                              647589
1                              613547
*VooblyErrorPlayerNotFound*      2672
Name: MatchPlayerWinner, dtype: int64

In [15]:
rm_1v1['MatchPlayerPostRating'].isnull().sum() # ok so we can manually fix the 86 missing ones since winners gain rating and loses lose rating

# do it tomorrow, lazy

0

In [16]:
rm_1v1.columns

Index(['MatchId', 'MatchUrl', 'MatchDate', 'MatchDuration', 'MatchLadder',
       'MatchMap', 'MatchMods', 'MatchPlayerId', 'MatchPlayerName',
       'MatchPlayerTeam', 'MatchPlayerCivId', 'MatchPlayerCivName',
       'MatchPlayerWinner', 'MatchPlayerPreRating', 'MatchPlayerPostRating',
       'MatchPlayerRecording'],
      dtype='object')

In [0]:
# so we prob need to drop matchplayerpost rating after using it to fix our empty player error not founds since that leaks
# matchID, matchURL, matchplayerRecording are probably useless
# civname and civID are 1:1, perfectly collinear so need to drop those
# nvm just drop


# Feature Engineering TO DO LIST

In [0]:
# do this later to feature engineer relative rating by getting P1 and P2 ratings on same row after merge

evens = df.iloc[list(range(0, len(df), 2))]
odds =  df.iloc[list(range(1, len(df), 2))]

# do a feature of tier list ranking of the user civ

# do some features of if they have good:
# eco, cavalry, archers, seige, etc. using some source somewhere

# FIT MODEL

In [20]:
rm_1v1.describe()

from sklearn.model_selection import train_test_split

train, val = train_test_split(rm_1v1, train_size = 0.8, test_size = 0.2)

train.shape, val.shape

((329374, 16), (82344, 16))

In [0]:
# run a model better than baseline 
target = 'MatchPlayerWinner'
features = ['MatchDuration', 'MatchDate','MatchMap','MatchMods','MatchPlayerCivId','MatchPlayerPreRating']

# looks sparse need to make a lot of feature engineering later
X_train = train[features]
y_train = train[target]

X_val = val[features]
y_val = val[target]


In [0]:
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

In [30]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    RandomForestClassifier(n_estimators = 100, n_jobs=-1)
)

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_val)

print('Train Accuracy', pipeline.score(X_train, y_train))

# well, that's, not good

Train Accuracy 0.9998239083837825


In [31]:
print('Validation Accuracy', pipeline.score(X_val, y_val))

# fail

Validation Accuracy 0.4987612940833576
