<a href="https://colab.research.google.com/github/tuanky/DS-Unit-2-Applied-Modeling/blob/master/Tuan_Ky_Build_Week_2_project_assignment_applied_modeling_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict? ** Crash Descriptor**
- [ ] Is your problem regression or classification? Classification
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?** 4 classes**
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency > 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy? **The majority class, Property Damage, is  within 50-70%**
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

In [0]:


!pip install dask



In [0]:
import pandas as pd
import numpy as np
import dask.dataframe as dd
!pip install psutil requests



In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
%%time
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    start_mem_gb = start_mem / 1024
    print(f'Memory usage of dataframe is {start_mem:.2f} MB',
          f'/ {start_mem_gb:.2f} GB')
    
    for col in df:
        col_type = str(df[col].dtypes)
        
        if col_type != 'object':
            c_min = df[col].min()
            c_max = df[col].max()
            if col_type[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                #if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                #    df[col] = df[col].astype(np.float16)
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    end_mem_gb = end_mem / 1024
    print(f'Memory usage after optimization is: {end_mem:.2f} MB',
          f'/ {end_mem_gb:.2f} GB')
    mem_dec = 100 * (start_mem - end_mem) / start_mem
    print(f'Decreased by {mem_dec:.1f}%')
    
    return df


def import_data(file):
    """create a dataframe and optimize its memory usage"""

    dtypes = {
        'AVProductStatesIdentifier': 'float64',
        'AVProductsEnabled': 'float64',
        'AVProductsInstalled': 'float64',
        'GeoNameIdentifier': 'float64',
        'IsProtected': 'float64',
        'PuaMode': 'object'
    }
    ddf = dd.read_csv(file, dtype=dtypes, parse_dates=True, keep_date_col=True)
    df = ddf.compute()
    df = reduce_mem_usage(df)
    return df

file = 'gdrive/My Drive/train.csv'
print('-' * 80)
print('train')
df = import_data(file)

--------------------------------------------------------------------------------
train
Memory usage of dataframe is 129.87 MB / 0.13 GB
Memory usage after optimization is: 31.46 MB / 0.03 GB
Decreased by 75.8%
CPU times: user 5.24 s, sys: 571 ms, total: 5.81 s
Wall time: 10.3 s


In [0]:
!pip install feather-format



In [0]:
print(df.shape)
df.head()

(895916, 18)


Unnamed: 0,Year,Crash Descriptor,Time,Date,Day of Week,Police Report,Lighting Conditions,Municipality,Collision Type Descriptor,County Name,Road Descriptor,Weather Conditions,Traffic Control Device,Road Surface Conditions,DOT Reference Marker Location,Pedestrian Bicyclist Action,Event Descriptor,Number of Vehicles Involved
0,2014,Injury Accident,5:35,6/18/2014,Wednesday,Y,Dawn,NEW YORK,OTHER,NEW YORK,Straight and Level,Clear,,Dry,,"Crossing, No Signal or Crosswalk","Pedestrian, Collision With",1
1,2014,Property Damage Accident,15:28,11/6/2014,Thursday,Y,Daylight,HENRIETTA,RIGHT ANGLE,MONROE,Straight and Level,Rain,,Wet,,Not Applicable,"Other Motor Vehicle, Collision With",2
2,2014,Property Damage Accident,15:27,3/19/2014,Wednesday,Y,Daylight,CICERO,OVERTAKING,ONONDAGA,Straight and Level,Cloudy,,Dry,,Not Applicable,"Other Motor Vehicle, Collision With",2
3,2014,Property Damage Accident,4:03,6/23/2014,Monday,Y,Dark-Road Unlighted,COLESVILLE,OTHER,BROOME,Straight and Grade,Clear,,Dry,88I91011017,Not Applicable,Deer,1
4,2014,Property Damage Accident,15:28,9/27/2014,Saturday,Y,Daylight,HECTOR,OTHER,SCHUYLER,Straight and Level,Cloudy,,Dry,79 63061019,Not Applicable,Deer,1


In [0]:
df.dtypes

Year                                int16
Crash Descriptor                 category
Time                             category
Date                             category
Day of Week                      category
Police Report                    category
Lighting Conditions              category
Municipality                     category
Collision Type Descriptor        category
County Name                      category
Road Descriptor                  category
Weather Conditions               category
Traffic Control Device           category
Road Surface Conditions          category
DOT Reference Marker Location    category
Pedestrian Bicyclist Action      category
Event Descriptor                 category
Number of Vehicles Involved          int8
dtype: object

In [0]:
#Choose your target. Which column in your tabular dataset will you predict? Crash Descriptor 
#link : https://catalog.data.gov/dataset/motor-vehicle-crashes-case-information-beginning-2009
y = df['Crash Descriptor']
X = df.drop(columns='Crash Descriptor')

In [0]:
y.describe()


count                       895916
unique                           4
top       Property Damage Accident
freq                        538018
Name: Crash Descriptor, dtype: object

In [0]:
y.value_counts(normalize=True)

Property Damage Accident             0.600523
Injury Accident                      0.204620
Property Damage & Injury Accident    0.191531
Fatal Accident                       0.003326
Name: Crash Descriptor, dtype: float64

In [0]:
# 2.Choose what data to hold out for your test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size =0.2, random_state=42)

In [0]:
# 3. Choose an appropriate evaluation metric

#Majority class baseline

y_pred = [True] * len(y_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_pred, y_test) 

0.0

In [0]:
!pip install category_encoders
import category_encoders as ce
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier



In [0]:
%%time
pipe = make_pipeline(
    ce.OrdinalEncoder(),
    RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1, 
                           max_depth= 3)
)
pipe.fit(X_train, y_train)

CPU times: user 49.2 s, sys: 381 ms, total: 49.6 s
Wall time: 14.1 s


In [0]:
print('Train Accuracy:', pipe.score(X_train, y_train))
print('Test Accuracy', pipe.score(X_test, y_test))

Train Accuracy: 0.6509713533091867
Test Accuracy 0.6522178319492812
