<a href="https://colab.research.google.com/github/clairesarraille/airbnb_price_prediction/blob/main/01_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What we did to data in `data_cleaning_ver04.ipynb`

- Import Raw CSV Files, Concatenate into one Dataframe:    
    - Within same loop that reads in csvs, append `city` and `state` columns based on file name:
- Clean `city` and `state` Columns:    
    - Remove `.csv` from end of `state` column:    
    - Replace dashes with spaces in `city` and `state` columns:    
  - Check Los Angeles `id` Col for Sci Notation:
  - Check that all city names came in:
  - Check that all state names came in:
- Remove Line Breaks and other special characters
    - Success! There are no line breaks in the text above.
- Explore Duplicates:
  - Duplicates on `ID`
    - Drop the above duplicate Columns:
  - Check Again after Dropping:
  - Duplicates on All Rows except `id`?
  - Duplicates on `description`
    - Duplicate Descriptions - Findings:
- Duplicates on listing_url
  - Check for overall Duplicates:
  - There are 0 Duplicate rows to drop:
- Remove irrelevant columns and those that do not add predictive value
- Reset Index before exporting to Git:
- Export data with dropped columns to Git Repo:
  - Import CSV stored in GitHub to Google Colab

# Import Packages

In [125]:
# Core Packages:
import pandas as pd
import pickle
import numpy as np
from numpy import unique
from numpy import arange
from matplotlib import pyplot

#"""
# Google Colab:
### Print Dataframe like Spreadsheet!
from google.colab import data_table
data_table.enable_dataframe_formatter()
#"""

#"""
# Mount Drive:
from google.colab import drive
drive.mount('/content/drive')
#"""

"""
# Load libraries
import numpy

from pandas import read_csv
from pandas import set_option
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
"""



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


'\n# Load libraries\nimport numpy\n\nfrom pandas import read_csv\nfrom pandas import set_option\nfrom pandas.plotting import scatter_matrix\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.model_selection import KFold\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.linear_model import Lasso\nfrom sklearn.linear_model import ElasticNet\nfrom sklearn.tree import DecisionTreeRegressor\nfrom sklearn.neighbors import KNeighborsRegressor\nfrom sklearn.svm import SVR\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.ensemble import GradientBoostingRegressor\nfrom sklearn.ensemble import ExtraTreesRegressor\nfrom sklearn.ensemble import AdaBoostRegressor\nfrom sklearn.metrics import mean_squared_error\nfrom sklearn import linear_model\n'

# Import GitHub CSV to Google Colab
- The remaining code must be executed in Google Colab and may require a professional subscription to run.
- Code for importing csvs from GitHub to Google Colab:
- When importing, use `low_memory=False`:
  - "Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser)."

In [126]:
# To get this URL, go to the csv file on GitHub:choose "Code", and then click on the "raw" link
# https://github.com/clairesarraille/airbnb_price_prediction/blob/main/airbnb_data.csv
#### IMPORTANT ####
# The link to this raw csv text MUST be re-generated each day to work properly.

url = 'https://media.githubusercontent.com/media/clairesarraille/airbnb_price_prediction/main/airbnb_data.csv?token=ARCBHGIE2BZ6GELDYJTW5K3FKPH5O'
df = pd.read_csv(url, low_memory=False)


HTTPError: ignored

# Preview Data:

In [None]:
df[['price', 'id', 'host_location', 'room_type', 'neighbourhood_cleansed', 'city', 'state']][0:1000]


In [None]:
df.info()

# Double Check for Duplicates:
- We did this in "data_cleaning_ver04.ipynb" -- but we'll double check

In [None]:
df.duplicated().any()

In [None]:
len(df)

In [None]:
df[['id']].nunique()

In [None]:
df.loc[df['city'] == 'Los Angeles'].sort_values(by = 'id',ascending=False).head(3)

# Data Cleaning:
- Finding and fixing flaws in a dataset that could have a detrimental effect on a prediction model is known as data cleaning.
- While there are many different kinds of errors that can occur in a dataset, some of the most basic ones are duplicate rows and columns lacking significant information.
## In this section we will:
1. Inspect Number of Unique values per column
2. Columns With Very Few Values - Scrutinize
3. Remove Columns with too many missing values
3. Convert columns to appropriate datatype
4. Code boolean columns as 0/1
5. Split into numeric, object, and boolean lists



## Determine Which Columns Only Have One Value to remove
- We can see from below that all columns have >= 2 values except `bathrooms` because all values are `NaN`

In [None]:
data_table.enable_dataframe_formatter()
# Check all columns for those that have a single value
df.nunique().sort_values(ascending=True)

In [None]:
# Bathrooms is the only col with a single value, all NaN
df[['bathrooms']].info()


In [None]:
# Drop bathrooms col:
df = df.drop('bathrooms', axis=1)


## Columns With Very Few Values
- Examine to identify boolean cols, and cols that most likely should be coded as categorical
- We'll measure the relative number of values in each column

#### Relative Percentages of Unique Values:
- We'll look at cols where the percentage is less .5%

In [None]:
# Adaped from Machine Learning Mastery Blog: https://machinelearningmastery.com/basic-data-cleaning-for-machine-learning/
for col in range(df.shape[1]):
  nunique_vals = df.iloc[:,col].nunique()
  relative = (float(nunique_vals) / df.shape[0]) * 100
  if relative < .5:
    print(f'"{df.columns[col]}" has {nunique_vals} unique values ({relative:.4f}%) -- datatype is: {df[df.columns[col]].dtypes}')

#print(f'"{df.columns[col]}" has {nunique_vals} unique values ({relative:.4f}%) -- datatype is: {df[df.columns[col]].dtypes}')

### Use above list to identify categorical cols
- The columns above that are numeric wouldn't be treated differently (wouldn't be coded as categorical or ordinal because they're continuous and one more unit of any of them isn't "better")
- Categorical Columns that will need to be One-Hot-Encoded:
  - city
  - state
  - room_type
  - host verifications


## Eliminate redundant columns and those with too many missing values

### Look at the number of `NaN` values each column has

In [None]:
DataFrame(df.isna().sum().sort_values(ascending=False))

### Examine the Neighborhood and Host Location Cols with lots of missing values:
- Here are categorical columns describing the neighborhood and location of the listing and the host:
  - `'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed','host_location', 'host_neighbourhood'`
- The Neighbourhood columns look like they may contain redundant data because we already have city and state.
- We'll sample a couple rows for each unique combo of city/state to see the variability in neighbourhood **values**

In [None]:
# The neighborhood columns have very large num of Nulls, so we can safely drop
df[['neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed','host_location', 'host_neighbourhood']].isna().sum().sort_values(ascending=False)

In [None]:
### Toggle this to see full sample dataset or excerpt:
#pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', 5)

In [None]:
# Create dataframe of random samples of rows for each unique value in city column
# Adapted from: https://stackoverflow.com/questions/38390242/sampling-one-record-per-unique-value-pandas-python

sample_neigh_df = df[['neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed','host_location','host_neighbourhood', 'city', 'state']].groupby('city').apply(lambda obj_df: obj_df.sample(10, random_state=42))
sample_neigh_df


### We can drop neighbourhood:
# Neighbourhood is either city/state/country, or it's neighborhood/state/country.
# In the case that one of the datasets was an entire county or MSA, this column could be the specific city within that county or MSA

### We can drop neighbourhood_cleansed:
# There isn't any consistency in the definition of this
# -- sometimes it's just directional (North, South)
# or a collection of neighborhoods in a list (Cleveland Park, Woodley Park, Massachusetts Avenue Heights)
# Or by supervisor district: "Ward E councilmember James Solomon"
# For Hawaii it specifies a neighborhood on any island: "Kapaa-Wailua" (which is on Kauai)

### We can drop neighbourhood_group_cleansed:
# This is mostly NaN values

### host_location and host_neighborhood:
# Drop these because they both have a ton of Nulls, the neighbohood isn't a standard way to compare places,
# and because this would probably only be meaningful if we had the distance that the host is from their listing
# -- which is not within scope of this project

In [None]:
# Drop listing neighbourhood columns:
df.drop(columns=['neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'host_location', 'host_neighbourhood'], inplace=True)


In [None]:
df.info()

# Preview All Columns to determine remaining Data Cleaning Tasks:

## Free-text Columns
- These will be TF-IDF vectorized so they can be machine-read
- We'll clean out the "br " and ".br br b" which is leftover from Excel artifacts

In [None]:
# Create test df to strip the pesky br strings
free_text_cols = ['description', 'neighborhood_overview', 'host_about']
df_free_text = df[free_text_cols]

In [None]:
# Create random sample of df_free_text to do QA
sample_free_text = df_free_text.apply(lambda df_free_text: df_free_text.sample(10, random_state=42))
sample_free_text

### Strings to remove:
- I searched this output table for pattern "br" and recorded the below strings to remove


In [None]:
br_list = 'bbr', 'br br', 'br br', 'bThe', 'br ', '.br br b'

In [None]:
# br_list = 'bbr', 'br br', 'br br', 'bThe', 'br ', '.br br b'
for i in free_text_cols:
    for string in br_list:
        df_free_text[i] = df_free_text[i].str.replace(string, ' ')

### Re-check random sample:
- This looks good. When we search for "br" in the table below we're not seeing any errors except "bRegi" so we'll replace that and check one more time with a different random_state

In [None]:
sample_free_text = df_free_text.apply(lambda df_free_text: df_free_text.sample(10, random_state=42))
sample_free_text

In [None]:
for i in free_text_cols:
    df_free_text[i] = df_free_text[i].str.replace('bReg', 'Reg')


In [None]:
sample_free_text = df_free_text.apply(lambda df_free_text: df_free_text.sample(10, random_state=42))
sample_free_text

### Check one last time with different random state:
- Looks Perfect when we search "br" we only get legitimate words

In [None]:
sample_free_text = df_free_text.apply(lambda df_free_text: df_free_text.sample(10, random_state=1))
sample_free_text

### Apply string replacement to df_free_text:

In [None]:
# br_list = 'bbr', 'br br', 'br br', 'bThe', 'br ', '.br br b'
for i in free_text_cols:
    for string in br_list:
        df_free_text[i] = df_free_text[i].str.replace(string, ' ')

In [None]:
data_table.enable_dataframe_formatter()
df.iloc[:,:20].head(3)

In [None]:
sample_free_text = df_free_text.apply(lambda df_free_text: df_free_text.sample(10, random_state=42))
sample_free_text

In [None]:
br_list = ['bbr', 'br br', 'br br', 'bThe', 'br ', '.br br b','bReg', 'Reg']
for i in free_text_cols:
    for string in br_list:
        df[i] = df[i].str.replace(string, ' ')

### Re-generate our `df_free_text` to examine results:
- Looks Good!

In [None]:
# free_text_cols = ['description', 'neighborhood_overview', 'host_about']
df_free_text = df[free_text_cols]

In [None]:
df_text_random = df_free_text.apply(lambda df_free_text: df_free_text.sample(10, random_state=42))
df_text_random

In [None]:
df.info()

## Numeric Columns:

- First let's make sure everyone who should be numeric is:

### Make our target (`price`) numeric:

In [None]:
# Also remove any commas in larger prices
df['price'] = df['price'].str.replace(',', '')
df['price'] = df['price'].astype(float)

### Convert `license` column to boolean:
- We will assume if there is text in the license column, it means there is a license, otherwise NULL

In [114]:
license_not_null_df = df[['license']].isnull().sort_values(by='license', ascending=True)

In [115]:
license_not_null_df.value_counts()

license
True       183196
False       91762
dtype: int64

In [116]:
df['license'] = (df['license'].notnull()).astype('int')

In [117]:
df['license'].value_counts()

0    183196
1     91762
Name: license, dtype: int64

In [122]:
inspect_license = df[['license']]
inspect_license.apply(lambda inspect_license: inspect_license.sample(20, random_state=42))

Unnamed: 0,license
170792,0
111785,0
...,...
274864,0
101199,1


In [123]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 274958 entries, 0 to 274957
Data columns (total 29 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   id                      274958 non-null  int64  
 1   name                    274958 non-null  object 
 2   description             272232 non-null  object 
 3   neighborhood_overview   167804 non-null  object 
 4   host_about              164351 non-null  object 
 5   host_verifications      274890 non-null  object 
 6   host_has_profile_pic    274943 non-null  object 
 7   host_identity_verified  274943 non-null  object 
 8   latitude                274958 non-null  float64
 9   longitude               274958 non-null  float64
 10  property_type           274958 non-null  object 
 11  room_type               274958 non-null  object 
 12  accommodates            274958 non-null  int64  
 13  bathrooms_text          274616 non-null  object 
 14  bedrooms            

### `instant_bookable`

In [124]:
df[['instant_bookable']].value_counts()

instant_bookable
f                   185563
t                    89395
dtype: int64

In [None]:
# Convert t/f values to 0/1

## Boolean Types - convert and clean

### Columns with 2 values are boolean

In [None]:
# Store Boolean Columns in List:
boolean_cols = []
for col in range(df.shape[1]):
  nunique_vals = df.iloc[:,col].nunique()
  relative = (float(nunique_vals) / df.shape[0]) * 100
  if nunique_vals == 2:
    boolean_cols.append(df.columns[col])
    print(f'"{df.columns[col]}" has {nunique_vals} unique values ({relative:.4f}%)')
print()
print(boolean_cols)

### Identify cols to encode as Categorical by removing numeric and boolean:

In [None]:
# Look at cols having percentage < .5, datatype is object, and NOT a boolean col:
categorical_cols = []
for col in range(df.shape[1]):
  nunique_vals = df.iloc[:,col].nunique()
  relative = (float(nunique_vals) / df.shape[0]) * 100
  if relative < .5 and df[df.columns[col]].dtypes == 'object' and df.columns[col] not in boolean_cols:
    categorical_cols.append(df.columns[col])
    print(f'"{df.columns[col]}" has {nunique_vals} unique values ({relative:.4f}%) -- datatype is: {df[df.columns[col]].dtypes}')
print()
for col in categorical_cols:
  print(col)

#### Likely Encode as Categorical:
- host_verifications
- neighbourhood
- neighbourhood_group_cleansed
- property_type
- room_type
- bathrooms_text
- city
- state

#### Other Considerations:
- neighbourhood has over 1,000 values and may lead to overfitting if we code as categorical

### Recall our list values holding our boolean cols and cols we may ordinal encode -- we'll use these later in our pipeline.

In [None]:
print(boolean_cols)
print(categorical_cols)

In [None]:
bool_df = df[boolean_cols]
cat_low_nunique_df = df[categorical_cols]

In [None]:
bool_df.head(3)

### Boolean fields that have `NaN` values:
- `host_has_profile_pic`
- `host_identity_verified`

In [None]:
bool_df.info()

### Inspect the categorical variables with low unique value counts:

In [None]:
cat_low_nunique_df.head(3)

## Split into numeric and text data to further clean/munge

### Non-Numeric cols minus Boolean Cols:

In [None]:
%unload_ext google.colab.data_table

In [None]:
# Create list of non-numeric cols minus the boolean columns:
obj_cols = []
for col in range(df.shape[1]):
  if df[df.columns[col]].dtypes == 'object' and df.columns[col] not in boolean_cols:
    obj_cols.append(df.columns[col])
print()
for col in obj_cols:
  print(col)
obj_df = df[obj_cols]
print()
data_table.disable_dataframe_formatter()
obj_df.iloc[0:5,:]

In [None]:
# Create list of non-numeric cols minus the boolean columns:
obj_cols_2 = []
for col in range(df.shape[1]):
  if df[df.columns[col]].dtypes == 'object' and df.columns[col] not in boolean_cols:
    obj_cols_2.append(df.columns[col])
print()
for col in obj_cols_2:
  print(col)
obj_df_2 = df[obj_cols_2]
print()
data_table.disable_dataframe_formatter()
obj_df_2.iloc[0:10,:].head()


## Numeric Columns:

In [None]:
# Create list of numeric cols:
num_cols = []
for col in range(df.shape[1]):
  if df[df.columns[col]].dtypes != 'object' and df.columns[col] not in boolean_cols:
    num_cols.append(df.columns[col])
print()
for col in num_cols:
  print(col)
num_df = df[num_cols]
print()
data_table.disable_dataframe_formatter()
num_df.iloc[0:10,:].head()

# NULLS and Outliers (We'll use Pipeline to impute)
- Numeric cols with NULLs are bedrooms, beds,

In [None]:
num_df.info()

In [7]:
data_table.enable_dataframe_formatter()
from pandas.core.frame import DataFrame
DataFrame(df.bedrooms.value_counts()).sort_index()

Unnamed: 0,bedrooms
1.0,88301
2.0,59690
3.0,33143
4.0,15687
5.0,5151
6.0,1926
7.0,600
8.0,453
9.0,160
10.0,93


In [None]:

DataFrame(df.beds.value_counts()).sort_index()

In [None]:
## Remove Outlier that has 132 beds

### Create second version of obj_col_df:

- Cut down `amenities` according to article
- Slice numbers from `bathroom_text`
- One-Hot Encode:
  - `host_verifications`
  - `property_type`
  - `room_type`
  - `amenities`
  - `city`
  - `state`

In [None]:
boolean_cols

# Pickle Cleaned Data:


In [None]:
# df.to_pickle('clean_data.pkl')

# df = pd.read_pickle("/content/drive/MyDrive/capstone_data/clean_data.pkl")

# ROUGH baseline model

In [None]:
df.info()

In [None]:
latitude                      274958 non-null  float64
longitude                     274958 non-null  float64
accommodates
minimum_nights                274958 non-null  int64
maximum_nights                274958 non-null  int64



In [None]:
shitty_first_cols = ['latitude', 'longitude', 'accommodates', 'minimum_nights', 'maximum_nights']
X = df[shitty_first_cols]

In [None]:
df['price']

In [None]:
y = df.price

In [None]:
shit_model = linear_model.LinearRegression()
shit_model.fit(X, y)

In [None]:
print(shit_model.intercept_, shit_model.coef_, shit_model.score(X, y))

## Model R-squared value is 3.33%