# Welcome to the Data Science Gym!
<a href="https://github.com/dskarbrevik/Data-Science-Gym">GitHub Repo</a>


### Workout Overview

**Workout ID:** DSG1ML1

**Workout type:** Machine Learning

**Workout data:** Iris Flower Dataset

**Workout difficulty:** &#11088;

### Who are you?

**Username:** David Skarbrevik

**Date you're doing this workout:** 2/20/2018

<br>

## Today's Workout Goal:

<br>

You are given data on three types of flowers, your goal is to **build a model that can classify these three types of flowers.** That's it!



You get exactly 3 hours to come up with your best model. This includes 

<br>

**More info on dataset:** <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set">Iris Dataset</a>

**Data already cleaned?:** Yes

**Data size large? (>1GB):** No

**Overall Difficulty:** 

***

<a id="toc"></a>

## Table of Contents

<br>

<ol>
    <li><a href="#section1">Loading the data</a></li>
    <br>
    <li><a href="#section2">Exploratory Data Analysis</a></li>
    <br>
    <li><a href="#section3">Modeling</a></li> 
    <br>
    <li><a href="#section4">Results</a></li>             
</ol>

***

Before we get into any code, we'll load some base libraries that we'll make regular use of. Other libraries will be imported as needed in the code below:
    

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from inspect import getmembers
from collections import OrderedDict, Counter
import numbers

***

<a id='section1_1'></a>

## Introduction to the Dataset

This is a VERY small, 150 observations, dataset where each row corresponds to one of three types of flowers. There are only five features in each observation/row (flower type, sepal length, sepal width, pedal width, pedal length). It is one of the canonical "Hello World"s of machine learning.

<a id='section1_2'></a>

## EDA on the dataset

By importing this dataset from the sklearn package we get the data in array form, however, to better explore the data it is generally useful to convert it to a Pandas dataframe. So let's do that.

In [2]:
iris = datasets.load_iris() # data imported from sklearn

In [3]:
# convert numerical categories to their actual flower names
flower_type = []
for i in iris.target:
        flower_type.append(iris.target_names[i])        

# create an ordered dictionary with all our features
iris_dict = OrderedDict([('Flower Type', flower_type), ('Sepal Length', iris.data[:, 0]), ('Sepal Width', iris.data[:, 1]), 
                         ('Petal Length', iris.data[:, 2]), ('Petal Width', iris.data[:, 3])])

# quickly verify that all arrays are the same length
for key in iris_dict.keys():
    print("{0}: {1}".format(key, len(iris_dict[key])))

# make Pandas DataFrame from ordered dictionary
iris_df = pd.DataFrame.from_dict(iris_dict)

Flower Type: 150
Sepal Length: 150
Sepal Width: 150
Petal Length: 150
Petal Width: 150


In [4]:
print("This dataset has {} observations (rows) and {} features (columns).".format(iris_df.shape[0], iris_df.shape[1]))

This dataset has 150 observations (rows) and 5 features (columns).


In [5]:
iris_df.head()

Unnamed: 0,Flower Type,Sepal Length,Sepal Width,Petal Length,Petal Width
0,setosa,5.1,3.5,1.4,0.2
1,setosa,4.9,3.0,1.4,0.2
2,setosa,4.7,3.2,1.3,0.2
3,setosa,4.6,3.1,1.5,0.2
4,setosa,5.0,3.6,1.4,0.2


In [6]:
iris_df.describe()

Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


Now that we have our data in a dataframe, we see there are quick functions we can run on it to get some basic information. 

Note the first few lines are all "setosa", this is because the dataset is ordered. The first 50 rows are Setosa, the next 50 are versicolor, and the last 50 rows are virginica. It is important to appreciate this, because when we model the data to answer some question, we want to train the model on a random subset of our data (not just give it the first 100 rows and find it can only classify 2 of the 3 types of flowers we have). 

Also, we can note from the basic statistics of the features that the petal length seems to have a particularly high standard deviation and thus may be the most valuable feature in differentiating one or more types of flowers (just a hunch, not a for sure).

## Modeling

Some possible modeling techniques for a discrete classification problem like this:
* logistic regression
* Naive Bayes
* K-NN
* Decision Tree

But first let's split our data into a training and testing groups so we can evaluate our models later...

#### Data split

In [12]:
# prepping our data to train models
from sklearn.model_selection import train_test_split

data = iris.data[:,:]
labels = flower_type
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, test_size=.20, random_state=25)

#### k-NN model

In [19]:
# training a k-NN classifier and evaluating it

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

n_neighbors = 3

knn = KNeighborsClassifier(n_neighbors)
knn = knn.fit(train_data, train_labels)
preds = knn.predict(test_data)

accuracy = accuracy_score(test_labels, preds)
print("Accuracy of k-NN with k = {} : {:.2f}%".format(n_neighbors, (accuracy*100)))

Accuracy of k-NN with k = 3 : 93.33%


#### Decision Tree

In [15]:
from sklearn import tree
tree = tree.DecisionTreeClassifier()
tree = tree.fit(iris.data, iris.target)
accuracy = accuracy_score(test_labels, preds)
print("Accuracy of decision tree: {:.2f}%".format((accuracy*100)))

Accuracy of decision tree: 90.00%


#### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier()


This dataset has almost no data and what data it has is extremeley specific. There aren't many interesting questions or great impact we can make with this data but we can at least see that being able to predict flower type is trivially easy, which is why it is the hello world of machine learning. Moving on!

***

In [1]:
from multiprocessing import Pool

In [None]:
def f(x):
    return x*x

p = Pool(5)
print(p.map(f, [1, 2, 3]))

<div align="right">
    <a href="#toc">back to top</a>
</div>
<a id='section2'></a>

# EDA Workout: PUBG (video game data) 

**Dataset:** <a href="https://www.kaggle.com/skihikingkevin/pubg-match-deaths">PUBG Dataset (video game data)</a>

**Main Idea:** in-game computer game match data (username, kills, movement data, teammate data)

**Data downloadable?:** Yes

**Data already cleaned?:** Yes

**Data size large? (>1GB):** Yes (about 20GB)

**Overall Difficulty:** &#11088;&#11088;

## Intro to data

There are two folders of data, one called "aggregate" and another called "deaths". Each has 10gb of data split into 5 csv files.

In [None]:
# FIRST!... load some data
sample = pd.read_csv("/fake/path/pubg-match-deaths/aggregate/agg_match_stats_0.csv", nrows = 10000)
death_data = pd.read_csv("//Desktop/pubg-match-deaths/deaths/kill_match_stats_final_0.csv", nrows = 10000)

In [None]:
sample.info()

In [None]:
death_data.info()

In [None]:
sample.describe()

In [None]:
death_data.describe()

In [None]:
sample[:100]

In [None]:
death_data[:100]

In [None]:
example_game1_1 = sample[sample["match_id"] == "2U4GBNA0YmnLSqvEycnTjo-KT000vfUnhSA2vfVhVPe1QBwCTNTBJ5B_1Ocel6nY"]
example_game1_2 = death_data[death_data["match_id"] == "2U4GBNA0YmnLSqvEycnTjo-KT000vfUnhSA2vfVhVPe1QBwCTNTBJ5B_1Ocel6nY"]

In [None]:
print(example_game1_1.columns.values)
print(example_game1_2.columns.values)

In [None]:
example2_edit = example_game1_2.rename(columns = {"victim_name":"player_name"})

In [None]:
example2_edit.columns.values

In [None]:
print(example_game1_1.shape)
print(example_game1_2.shape)

In [None]:
single_match = pd.merge(example_game1_1, example2_edit, how='left', on=['match_id', 'player_name'])

In [None]:
single_match.sort_values("team_placement").head(10)

In [None]:
simple_match = single_match[["player_name","player_kills","player_dist_ride","player_dist_walk","player_dmg","player_survive_time", "team_placement"]]

In [None]:
simple_match.head()