## Python for Data Analysis
#### An Analysis of AirBnB Listings
This section is intended to introduce you to data analysis concepts in Python through an interactive exercise. The purpose of this exercise is to learn some basic data analysis and visualization techniques in Python while exploring the AirBnB listings data set for trends and other interesting insights. 

Research question: What factors impact review scores for airBnB listings?

In [1]:
# import libraries
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
import datetime 

In [2]:
# Get current working directory
os.getcwd()

'C:\\Users\\rrasheed\\Desktop\\Python 101'

In [5]:
# Always good practice to set your working directory
# The folder that Python is pointing to to read data from and read data to

# You will need to change this path to your own working directory.
# Make sure to include the r in your string to denote a string literal, as python 
# does not recognize back slashes without this notation

path = r'C:\Users\rrasheed\Desktop\Python 101'
os.chdir(path)

In [6]:
# Read csv into a pandas dataframe
df = pd.read_csv('airbnb.csv')

# Notice warning about mixed data types.
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,accommodates,bathrooms,bed_type,bedrooms,beds,cancellation_policy,city,cleaning_fee,first_review,host_has_profile_pic,...,instant_bookable,last_review,latitude,longitude,neighbourhood,number_of_reviews,property_type,review_scores_rating,room_type,zipcode
0,1,1.0,Real Bed,1,1,flexible,SF,True,8/15/2016,t,...,t,1/18/2017,37.773742,-122.391503,Mission Bay,5,Apartment,95,Private room,95202\r\r\r\r\r\r\r
1,1,1.0,Real Bed,1,1,flexible,LA,False,,t,...,f,,34.13568,-118.400691,Studio City,6,House,100,Private room,91604-3646
2,2,1.0,Real Bed,1,1,moderate,LA,True,11/16/2011,t,...,f,4/8/2017,34.192617,-118.136794,Altadena,136,Apartment,96,Entire home/apt,91001-2243
3,2,1.0,Real Bed,1,1,flexible,LA,False,,t,...,f,,34.023618,-118.501174,Santa Monica,0,Apartment,96,Private room,90403-2638
4,2,1.0,Real Bed,1,1,moderate,LA,True,9/6/2010,t,...,f,2/28/2017,34.104213,-118.26004,Silver Lake,16,House,99,Private room,90039-2715


In [7]:
df.columns[22]

'zipcode'

In [8]:
# Read in your data set but this time specify the data type
df = pd.read_csv('airbnb.csv', dtype={'zipcode':str})

In [9]:
# Reference/slice your data using numeric indexing
# Indexing starts at 0 rather than 1
df.iloc[-10:,[22]]

Unnamed: 0,zipcode
99559,2108.0
99560,2108.0
99561,2108.0
99562,2108.0
99563,2108.0
99564,2108.0
99565,2108.0
99566,2026.0
99567,210.0
99568,


In [None]:
# Let's find out the dimensions of our data set
print(df.shape, df.size)

In [None]:
# Output first 5 rows of the data
df.head()

In [None]:
df.info()

## Data Cleaning

In [None]:
# Identify columns and their data types
df.dtypes

In [None]:
df.columns.get_loc("zipcode")
df['zipcode'][(df['city']=='LA') & (df['property_type']=='Apartment')].head()

In [None]:
# Clean zip code data

# Extract only the first 5 characters 
df['zipcode'] = df['zipcode'].str[:5]

# left pad string with zeros up to 5 characters
df['zipcode'] = df['zipcode'].str.zfill(5)
df['zipcode'].head(n=5)

In [None]:
# Convert date fields to datetime
df['first_review'] = pd.to_datetime(df['first_review'])
df['host_since'] = pd.to_datetime(df['host_since'])
df['last_review'] = pd.to_datetime(df['last_review'])

Exercise: write a for loop that will convert dat fields to datetime

In [None]:
def convert_date(field_name):
    df[field_name] = pd.to_datetime(df[field_name])

In [None]:
# Convert binary variables to 0 and 1 for consistency
df['cleaning_fee'] = np.where(df['cleaning_fee']==True, 1, 0)
df['host_has_profile_pic'] = np.where(df['host_has_profile_pic']=='t',1,0)
df['host_identity_verified'] = np.where(df['host_identity_verified']=='t',1,0)
df['instant_bookable'] = np.where(df['instant_bookable']=='t',1,0)

## Feature Engineering

Feature engineering is a process where you transform variables or create new variables for the purpose of improving model performance.

In [None]:
df['property_type'].value_counts()

In [None]:
df['property_type_feature'] = df['property_type']

df['property_type_feature'][(df['property_type_feature'] != 'Apartment') & 
                            (df['property_type_feature'] != 'House')] = 'Other'

In [None]:
df['property_type_feature'].value_counts()

In [None]:
# Segmenting
med_score = df['review_scores_rating'].median()
med_score

In [None]:
df['below_average'] = np.where(df['review_scores_rating'] < med_score, 1, 0)
df.head()

In [None]:
categorical_feats = ['bed_type', 'cancellation_policy', 'city', 'property_type_feature', 'room_type']
df_dummy = pd.get_dummies(df, columns=categorical_feats, drop_first=True)
df_dummy.head()

In [None]:
# Lambda function coupled with apply on a pandas series
df = df.sort_values(by='first_review')
df['month'] = df['first_review'].apply(lambda x: x.month)
df

In [None]:
# This does the same thing functionally as lambda function above
# This function takes in an input parameter called "date" and returns the month of that date.
def retrieveMonth(date):
    return date.month