Finding Missing Values

https://medium.com/analytics-vidhya/python-finding-missing-values-in-a-data-frame-3030aaf0e4fd

Handling Missing Values in a Data Frame


https://medium.com/analytics-vidhya/python-handling-missing-values-in-a-data-frame-4156dac4399

In [16]:
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

## Seattle Airbnb Open Data

https://www.kaggle.com/datasets/airbnb/seattle?resource=download&select=reviews.csv

####  Context
Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in Seattle, WA.

#### Content
The following Airbnb activity is included in this Seattle dataset:

Listings, including full descriptions and average review score

Reviews, including unique id for each reviewer and detailed comments

Calendar, including listing id and the price and availability for that day
#### Inspiration
Can you describe the vibe of each Seattle neighborhood using listing descriptions?
What are the busiest times of the year to visit Seattle? By how much do prices spike?
Is there a general upward trend of both new Airbnb listings and total Airbnb visitors to Seattle?

http://insideairbnb.com/get-the-data/

# Loading data 

In [17]:
df_c = pd.read_csv("calendar.csv")
df_l = pd.read_csv("listings.csv")
df_r = pd.read_csv("reviews.csv")

In [31]:
df_l['reviews_per_month'].head(3)

0    4.07
1    1.48
2    1.15
Name: reviews_per_month, dtype: float64

In [4]:
df_c.describe()
df_l.describe()
df_r.describe()

Unnamed: 0,listing_id,id,reviewer_id
count,84849.0,84849.0,84849.0
mean,3005067.0,30587650.0,17013010.0
std,2472877.0,16366130.0,13537040.0
min,4291.0,3721.0,15.0
25%,794633.0,17251270.0,5053141.0
50%,2488228.0,32288090.0,14134760.0
75%,4694479.0,44576480.0,27624020.0
max,10248140.0,58736510.0,52812740.0


# Finding Missing Data

Counting null values per column per dataframe

In [5]:
display(df_c.isnull().value_counts())
display(df_l.isnull().value_counts())
display(df_r.isnull().value_counts())

listing_id  date   available  price
False       False  False      False    934542
                              True     459028
dtype: int64

id     listing_url  scrape_id  last_scraped  name   summary  space  description  experiences_offered  neighborhood_overview  notes  transit  thumbnail_url  medium_url  picture_url  xl_picture_url  host_id  host_url  host_name  host_since  host_location  host_about  host_response_time  host_response_rate  host_acceptance_rate  host_is_superhost  host_thumbnail_url  host_picture_url  host_neighbourhood  host_listings_count  host_total_listings_count  host_verifications  host_has_profile_pic  host_identity_verified  street  neighbourhood  neighbourhood_cleansed  neighbourhood_group_cleansed  city   state  zipcode  market  smart_location  country_code  country  latitude  longitude  is_location_exact  property_type  room_type  accommodates  bathrooms  bedrooms  beds   bed_type  amenities  square_feet  price  weekly_price  monthly_price  security_deposit  cleaning_fee  guests_included  extra_people  minimum_nights  maximum_nights  calendar_updated  has_availability  availability_30  availabi

listing_id  id     date   reviewer_id  reviewer_name  comments
False       False  False  False        False          False       84831
                                                      True           18
dtype: int64

Counting datatypes for each dataframe

In [6]:
df_c.dtypes.value_counts()
df_l.dtypes.value_counts()
df_r.dtypes.value_counts()

int64     3
object    3
dtype: int64

Seperate categorical and numerical columns

In [7]:
num_vars_c = df_c.columns[df_c.dtypes != 'object']
cat_vars_c = df_c.columns[df_c.dtypes == 'object']

num_vars_l = df_l.columns[df_l.dtypes != 'object']
cat_vars_l = df_l.columns[df_l.dtypes == 'object']

num_vars_r = df_r.columns[df_r.dtypes != 'object']
cat_vars_r = df_r.columns[df_r.dtypes == 'object']

In [8]:
print(num_vars_c)
print(cat_vars_c)

df_c[num_vars_c]
df_l[num_vars_l]

Index(['listing_id'], dtype='object')
Index(['date', 'available', 'price'], dtype='object')


Unnamed: 0,id,scrape_id,host_id,host_listings_count,host_total_listings_count,latitude,longitude,accommodates,bathrooms,bedrooms,...,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,calculated_host_listings_count,reviews_per_month
0,241032,20160104002432,956883,3.0,3.0,47.636289,-122.371025,4,1.0,1.0,...,95.0,10.0,10.0,10.0,10.0,9.0,10.0,,2,4.07
1,953595,20160104002432,5177328,6.0,6.0,47.639123,-122.365666,4,1.0,1.0,...,96.0,10.0,10.0,10.0,10.0,10.0,10.0,,6,1.48
2,3308979,20160104002432,16708587,2.0,2.0,47.629724,-122.369483,11,4.5,5.0,...,97.0,10.0,10.0,10.0,10.0,10.0,10.0,,2,1.15
3,7421966,20160104002432,9851441,1.0,1.0,47.638473,-122.369279,3,1.0,0.0,...,,,,,,,,,1,
4,278830,20160104002432,1452570,2.0,2.0,47.632918,-122.372471,6,2.0,3.0,...,92.0,9.0,9.0,10.0,10.0,9.0,9.0,,1,0.89
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3813,8101950,20160104002432,31148752,354.0,354.0,47.664295,-122.359170,6,2.0,3.0,...,80.0,8.0,10.0,4.0,8.0,10.0,8.0,,8,0.30
3814,8902327,20160104002432,46566046,1.0,1.0,47.649552,-122.318309,4,1.0,1.0,...,100.0,10.0,10.0,10.0,10.0,10.0,10.0,,1,2.00
3815,10267360,20160104002432,52791370,1.0,1.0,47.508453,-122.240607,2,1.0,1.0,...,,,,,,,,,1,
3816,9604740,20160104002432,25522052,1.0,1.0,47.632335,-122.275530,2,1.0,0.0,...,,,,,,,,,1,


Finds which fields has missing values 'isnull' and counts how many 'sum()' and orders 'sort_values()' them with null columns at the top\
Specifically looking at the numerical columns 'num_vars'

In [9]:
df_c[num_vars_c].isnull().sum().sort_values(ascending=False)
df_l[num_vars_l].isnull().sum().sort_values(ascending=False)

license                           3818
square_feet                       3721
review_scores_accuracy             658
review_scores_checkin              658
review_scores_value                656
review_scores_location             655
review_scores_cleanliness          653
review_scores_communication        651
review_scores_rating               647
reviews_per_month                  627
bathrooms                           16
bedrooms                             6
host_total_listings_count            2
host_listings_count                  2
beds                                 1
availability_365                     0
calculated_host_listings_count       0
number_of_reviews                    0
id                                   0
availability_90                      0
availability_60                      0
scrape_id                            0
maximum_nights                       0
minimum_nights                       0
guests_included                      0
accommodates             

Shows what percentage of data is missing per column

In [10]:
round(((df_l[num_vars_l].isnull().sum().sort_values(ascending=False)/len(df_l)) * 100), 1)

license                           100.0
square_feet                        97.5
review_scores_accuracy             17.2
review_scores_checkin              17.2
review_scores_value                17.2
review_scores_location             17.2
review_scores_cleanliness          17.1
review_scores_communication        17.1
review_scores_rating               16.9
reviews_per_month                  16.4
bathrooms                           0.4
bedrooms                            0.2
host_total_listings_count           0.1
host_listings_count                 0.1
beds                                0.0
availability_365                    0.0
calculated_host_listings_count      0.0
number_of_reviews                   0.0
id                                  0.0
availability_90                     0.0
availability_60                     0.0
scrape_id                           0.0
maximum_nights                      0.0
minimum_nights                      0.0
guests_included                     0.0


# Cleaning Missing Data

Deleting columns/rows with missing data

In [11]:
df_l = df_l.dropna(subset=['host_name'], how='any', axis=0) # delete rows with na in 'host_name' column

Inputation for Numerical values

In [12]:
# function to fill missing values with mean for numerical cols
fill_mean = lambda col: col.fillna(col.mean()) # you can replace mean() with median() and mode()

# apply function to fill the missing values
df_l[num_vars_l] = df_l[num_vars_l].apply(fill_mean)
df_l[num_vars_l]

Unnamed: 0,id,scrape_id,host_id,host_listings_count,host_total_listings_count,latitude,longitude,accommodates,bathrooms,bedrooms,...,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,calculated_host_listings_count,reviews_per_month
0,241032,20160104002432,956883,3.0,3.0,47.636289,-122.371025,4,1.0,1.0,...,95.000000,10.000000,10.000000,10.000000,10.000000,9.000000,10.000000,,2,4.070000
1,953595,20160104002432,5177328,6.0,6.0,47.639123,-122.365666,4,1.0,1.0,...,96.000000,10.000000,10.000000,10.000000,10.000000,10.000000,10.000000,,6,1.480000
2,3308979,20160104002432,16708587,2.0,2.0,47.629724,-122.369483,11,4.5,5.0,...,97.000000,10.000000,10.000000,10.000000,10.000000,10.000000,10.000000,,2,1.150000
3,7421966,20160104002432,9851441,1.0,1.0,47.638473,-122.369279,3,1.0,0.0,...,94.539262,9.636392,9.556398,9.786709,9.809599,9.608916,9.452245,,1,2.078919
4,278830,20160104002432,1452570,2.0,2.0,47.632918,-122.372471,6,2.0,3.0,...,92.000000,9.000000,9.000000,10.000000,10.000000,9.000000,9.000000,,1,0.890000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3813,8101950,20160104002432,31148752,354.0,354.0,47.664295,-122.359170,6,2.0,3.0,...,80.000000,8.000000,10.000000,4.000000,8.000000,10.000000,8.000000,,8,0.300000
3814,8902327,20160104002432,46566046,1.0,1.0,47.649552,-122.318309,4,1.0,1.0,...,100.000000,10.000000,10.000000,10.000000,10.000000,10.000000,10.000000,,1,2.000000
3815,10267360,20160104002432,52791370,1.0,1.0,47.508453,-122.240607,2,1.0,1.0,...,94.539262,9.636392,9.556398,9.786709,9.809599,9.608916,9.452245,,1,2.078919
3816,9604740,20160104002432,25522052,1.0,1.0,47.632335,-122.275530,2,1.0,0.0,...,94.539262,9.636392,9.556398,9.786709,9.809599,9.608916,9.452245,,1,2.078919


Imputation for Categorial values

In [13]:
# fill missing values in categorical col we will just put 'missing data'
df_l[cat_vars_l] = df_l[cat_vars_l].fillna('Missing_Data')
df_l[cat_vars_l]

Unnamed: 0,listing_url,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,...,has_availability,calendar_last_scraped,first_review,last_review,requires_license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification
0,https://www.airbnb.com/rooms/241032,2016-01-04,Stylish Queen Anne Apartment,Missing_Data,Make your self at home in this charming one-be...,Make your self at home in this charming one-be...,none,Missing_Data,Missing_Data,Missing_Data,...,t,2016-01-04,2011-11-01,2016-01-02,f,WASHINGTON,f,moderate,f,f
1,https://www.airbnb.com/rooms/953595,2016-01-04,Bright & Airy Queen Anne Apartment,Chemically sensitive? We've removed the irrita...,"Beautiful, hypoallergenic apartment in an extr...",Chemically sensitive? We've removed the irrita...,none,"Queen Anne is a wonderful, truly functional vi...",What's up with the free pillows? Our home was...,"Convenient bus stops are just down the block, ...",...,t,2016-01-04,2013-08-19,2015-12-29,f,WASHINGTON,f,strict,t,t
2,https://www.airbnb.com/rooms/3308979,2016-01-04,New Modern House-Amazing water view,New modern house built in 2013. Spectacular s...,"Our house is modern, light and fresh with a wa...",New modern house built in 2013. Spectacular s...,none,Upper Queen Anne is a charming neighborhood fu...,Our house is located just 5 short blocks to To...,A bus stop is just 2 blocks away. Easy bus a...,...,t,2016-01-04,2014-07-30,2015-09-03,f,WASHINGTON,f,strict,f,f
3,https://www.airbnb.com/rooms/7421966,2016-01-04,Queen Anne Chateau,A charming apartment that sits atop Queen Anne...,Missing_Data,A charming apartment that sits atop Queen Anne...,none,Missing_Data,Missing_Data,Missing_Data,...,t,2016-01-04,Missing_Data,Missing_Data,f,WASHINGTON,f,flexible,f,f
4,https://www.airbnb.com/rooms/278830,2016-01-04,Charming craftsman 3 bdm house,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,none,We are in the beautiful neighborhood of Queen ...,Belltown,The nearest public transit bus (D Line) is 2 b...,...,t,2016-01-04,2012-07-10,2015-10-24,f,WASHINGTON,f,strict,f,f
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3813,https://www.airbnb.com/rooms/8101950,2016-01-04,3BR Mountain View House in Seattle,Our 3BR/2BA house boasts incredible views of t...,"Our 3BR/2BA house bright, stylish, and wheelch...",Our 3BR/2BA house boasts incredible views of t...,none,We're located near lots of family fun. Woodlan...,Missing_Data,Missing_Data,...,t,2016-01-04,2015-09-27,2015-09-27,f,WASHINGTON,f,strict,f,f
3814,https://www.airbnb.com/rooms/8902327,2016-01-04,Portage Bay View!-One Bedroom Apt,800 square foot 1 bedroom basement apartment w...,This space has a great view of Portage Bay wit...,800 square foot 1 bedroom basement apartment w...,none,The neighborhood is a quiet oasis that is clos...,This is a basement apartment in a newer reside...,Uber and Car2go are good options in Seattle. T...,...,t,2016-01-04,2015-12-18,2015-12-24,f,WASHINGTON,f,moderate,f,f
3815,https://www.airbnb.com/rooms/10267360,2016-01-04,Private apartment view of Lake WA,"Very comfortable lower unit. Quiet, charming m...",Missing_Data,"Very comfortable lower unit. Quiet, charming m...",none,Missing_Data,Missing_Data,Missing_Data,...,t,2016-01-04,Missing_Data,Missing_Data,f,WASHINGTON,f,moderate,f,f
3816,https://www.airbnb.com/rooms/9604740,2016-01-04,Amazing View with Modern Comfort!,Cozy studio condo in the heart on Madison Park...,Fully furnished unit to accommodate most needs...,Cozy studio condo in the heart on Madison Park...,none,Madison Park offers a peaceful slow pace upsca...,Missing_Data,Yes,...,t,2016-01-04,Missing_Data,Missing_Data,f,WASHINGTON,f,moderate,f,f


Imputation using a model to predict missing values

In [15]:
# numeric_cols = df_l.select_dtypes(include=[np.number])
# non_numerica_cols = df_l.select_dtypes(include=[object])
# imp = IterativeImputer(RandomForestRegressor(), initial_strategy='median', max_iter=10, random_state=0, min_value=1, max_value=9)
# imp_iter = imp.fit_transform(numeric_cols)

In [None]:
# # verify if the numeric cols have any null values
# imputed_data_numeric = pd.DataFrame(imp_iter, columns=numeric_cols.columns())
# imputed_data_numeric.isnull().sum().sort_values(ascending=False)