Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Wrangle ML datasets

- [ ] Continue to clean and explore your data. 
- [ ] For the evaluation metric you chose, what score would you get just by guessing?
- [ ] Can you make a fast, first model that beats guessing?

**We recommend that you use your portfolio project dataset for all assignments this sprint.**

**But if you aren't ready yet, or you want more practice, then use the New York City property sales dataset for today's assignment.** Follow the instructions below, to just keep a subset for the Tribeca neighborhood, and remove outliers or dirty data. [Here's a video walkthrough](https://youtu.be/pPWFw8UtBVg?t=584) you can refer to if you get stuck or want hints!

- Data Source: [NYC OpenData: NYC Citywide Rolling Calendar Sales](https://data.cityofnewyork.us/dataset/NYC-Citywide-Rolling-Calendar-Sales/usep-8jbt)
- Glossary: [NYC Department of Finance: Rolling Sales Data](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page)

In [0]:
# Work from previous assignment
%%capture
import pandas as pd
import sys
from sklearn.model_selection import train_test_split

if 'google.colab' in sys.modules:
    # Install packages in Colab
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

# Read data
url = 'https://raw.githubusercontent.com/doinalangille/DS-Unit-2-Applied-Modeling/master/data/online_shoppers_intention.csv'
df = pd.read_csv(url)

# The dataset contains 125 duplicate rows
# Keep only one unique row
df.drop_duplicates(keep='first', inplace=True)

# Delete outliers - the last 1%
cols = ['Administrative', 'Administrative_Duration', 'Informational',
       'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration',
       'BounceRates', 'PageValues']

for c in cols:
  if (c=='ProductRelated_Duration'):
    condition = df[c] <= (max(df[c]) - max(df[c])*0.35)
    df = df[condition]
  else:
    condition = df[c] <= (max(df[c]) - max(df[c])*0.01)
    df = df[condition]

# Some variables have to be categorical, not numerical
# Transform them into strings
cols = ['SpecialDay', 'OperatingSystems', 'Browser', 'Region', 'TrafficType']
df[cols] = df[cols].astype(str)

# Use random split, because data is provided for just one year, and it is possible to lose valuable insights if splitting by month
# Split dataframe into train & test
train, test = train_test_split(df, train_size=0.80, test_size=0.20, 
                              stratify=df['Revenue'], random_state=42)
# Split train into train & val
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['Revenue'], random_state=42)

In [2]:
train.shape, val.shape, test.shape

((7433, 18), (1859, 18), (2324, 18))

### Feature Engineering

In [8]:
train.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
10587,5,59.527778,3,124.25,156,5189.960659,0.001235,0.008929,20.509638,0.0,Nov,2,2,7,1,Returning_Visitor,False,True
6787,8,161.668571,0,0.0,518,11976.72135,3.8e-05,0.003837,0.0,0.0,Oct,4,2,9,2,Returning_Visitor,False,False
4625,0,0.0,0,0.0,7,38.0,0.142857,0.171429,0.0,0.0,May,2,2,2,13,Returning_Visitor,False,False
3365,5,436.0,0,0.0,7,26.0,0.066667,0.075,0.0,0.0,May,3,2,2,13,Returning_Visitor,False,False
950,0,0.0,0,0.0,42,1185.666667,0.0,0.004762,0.0,0.0,Mar,2,2,1,1,Returning_Visitor,False,False


In [14]:
train['BounceRates'].unique()

array([1.23456800e-03, 3.83000000e-05, 1.42857143e-01, ...,
       6.57894700e-03, 1.13821100e-03, 4.02116400e-03])