# Data Types and Missing Values

In [67]:
import pandas as pd

wines_df = pd.read_csv('./wines.csv')
wines_df.columns = wines_df.columns.str.strip()
wines_df

Unnamed: 0.1,Unnamed: 0,country,price,points,comments
0,0,USA,55.0,100,comments 1
1,1,Canada,,95,comments 2
2,2,Brazil,33.0,83,comments 3
3,3,,73.0,33,comments 4
4,4,USA,37.0,32,comments 5


## Data Types
Notice country has a missing value, and price has one NaN value. We need to handle this. We'll also remove the unnamed column. But first, we'll need to assess the data type of each and ensure that it's correct.

In [68]:
wines_df.drop('Unnamed: 0', axis=1, inplace=True)
wines_df

Unnamed: 0,country,price,points,comments
0,USA,55.0,100,comments 1
1,Canada,,95,comments 2
2,Brazil,33.0,83,comments 3
3,,73.0,33,comments 4
4,USA,37.0,32,comments 5


In [69]:
wines_df.dtypes

country     object
price       object
points       int64
comments    object
dtype: object

Price appears to be incorrect as an object. We'll need to convert it to a float.

In [70]:
wines_df.price = wines_df.price.astype('float')

Now, let's check again.

In [71]:
wines_df.dtypes 

country      object
price       float64
points        int64
comments     object
dtype: object

## Missing Values

Now we can handle missing values. We'll use the fillna() method to fill in the missing values with the mean of the column.

In [72]:
wines_df

Unnamed: 0,country,price,points,comments
0,USA,55.0,100,comments 1
1,Canada,,95,comments 2
2,Brazil,33.0,83,comments 3
3,,73.0,33,comments 4
4,USA,37.0,32,comments 5


In [73]:
wines_df.price.fillna(wines_df.price.mean(), inplace=True)
wines_df

Unnamed: 0,country,price,points,comments
0,USA,55.0,100,comments 1
1,Canada,49.5,95,comments 2
2,Brazil,33.0,83,comments 3
3,,73.0,33,comments 4
4,USA,37.0,32,comments 5


Now we'll handle the missing country. We'll use the fillna() method to fill in the missing values with the most common country.

In [74]:
wines_df.country.fillna('Unknown', inplace=True)
wines_df

Unnamed: 0,country,price,points,comments
0,USA,55.0,100,comments 1
1,Canada,49.5,95,comments 2
2,Brazil,33.0,83,comments 3
3,,73.0,33,comments 4
4,USA,37.0,32,comments 5


It doesn't work! But, this is likely because the empty string is not actually an empty string. Let's check.

In [75]:
wines_df.country[pd.isnull(wines_df.country)]

Series([], Name: country, dtype: object)

We can see that none of the values return as null! So, we need to look for empty strings.

In [76]:
wines_df.country = wines_df.country.replace(r'^[ ]+$', value='Unknown', regex=True)
wines_df

Unnamed: 0,country,price,points,comments
0,USA,55.0,100,comments 1
1,Canada,49.5,95,comments 2
2,Brazil,33.0,83,comments 3
3,Unknown,73.0,33,comments 4
4,USA,37.0,32,comments 5
