# Data types and missing data workbook

## Introduction

This is the workbook component of the "Data types and missing data" section of the tutorial.

# Relevant Resources
- [Data Types and Missing Data Reference](https://www.kaggle.com/residentmario/data-types-and-missing-data-reference)

# Set Up
**Fork this notebook using the button towards the top of the screen.**

Run the following cell to load your data and some utility functions

In [1]:
import pandas as pd
import seaborn as sns

import sys
sys.path.append('../../input/advanced-pandas-exercises/')
from data_types_missing_data import *

reviews = pd.read_csv("../../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option('max_rows', 5)

In [24]:
reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,null_price
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,True
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,False
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,False
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,False
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,False


**Exercise 1**: What is the data type of the `points` column in the dataset?

In [5]:
print(type(reviews.points))
print()
print(reviews.points.dtype)

<class 'pandas.core.series.Series'>

int64


In [4]:
answer_q1()

reviews.points.dtype


**Exercise 2**: Create a `Series` from entries in the `price` column, but convert the entries to strings. Hint: strings are `str` in native Python.

In [15]:
reviews.price.astype(str)

0          nan
1         15.0
          ... 
129969    32.0
129970    21.0
Name: price, Length: 129971, dtype: object

In [14]:
answer_q2()

reviews.price.astype(str)


Here are a few visual exercises on missing data.

**Exercise 3**: Some wines do not list a price. How often does this occur? Generate a `Series`that, for each review in the dataset, states whether the wine reviewed has a null `price`.

In [23]:
reviews['null_price'] = reviews['price'].isna()

reviews.price.isna()

0          True
1         False
          ...  
129969    False
129970    False
Name: price, Length: 129971, dtype: bool

In [20]:
answer_q3()

reviews.price.isnull()


**Exercise 4**: What are the most common wine-producing regions? Create a `Series` counting the number of times each value occurs in the `region_1` field. This field is often missing data, so replace missing values with `Unknown`. Sort in descending order.  Your output should look something like this:

```
Unknown                    21247
Napa Valley                 4480
                           ...  
Bardolino Superiore            1
Primitivo del Tarantino        1
Name: region_1, Length: 1230, dtype: int64
```

In [43]:
reviews['region_1_noNaN'] = reviews.region_1.fillna('Unknown')
reviews.head()
reviews.groupby('region_1_noNaN').region_1_noNaN.count().sort_values(ascending=False)


region_1_noNaN
Unknown          21247
Napa Valley       4480
                 ...  
Canada               1
Massachusetts        1
Name: region_1_noNaN, Length: 1230, dtype: int64

In [44]:
answer_q4()

reviews.region_1.fillna("Unknown").value_counts()


In [47]:
reviews.region_1.fillna("Unknown").value_counts()

Unknown            21247
Napa Valley         4480
                   ...  
Maury Sec              1
Goulburn Valley        1
Name: region_1, Length: 1230, dtype: int64

**Exercise 5**: A neat property of boolean data types, like the ones created by the `isnull()` method, is that `False` gets treated as 0 and `True` as 1 when performing math on the values. Thus, the `sum()` of a list of boolean values will return how many times `True` appears in that list.
Create a `pandas` `Series` showing how many times each of the columns in the dataset contains null values. Your result should look something like this:

```
country        63
description     0
               ..
variety         1
winery          0
Length: 13, dtype: int64
```

Hint: write a map that will extract the vintage of each wine in the dataset. The vintages reviewed range from 2000 to 2017, no earlier or later. Use `fillna` to impute the missing values.

In [49]:
reviews.isnull().sum()

country           63
description        0
                  ..
null_price         0
region_1_noNaN     0
Length: 15, dtype: int64

In [50]:
answer_q5()


def find_year(description):
    if "20" in description:
        idx = description.index("20")
        maybe_date = description[idx:idx + 4]
        if maybe_date.isdigit():
            return maybe_date
        else:
            return None
    else:
        return None
        
reviews.title.map(find_year).fillna("N/A").sort_values()



In [51]:
def find_year(description):
    if "20" in description:
        idx = description.index("20")
        maybe_date = description[idx:idx + 4]
        if maybe_date.isdigit():
            return maybe_date
        else:
            return None
    else:
        return None
        
reviews.title.map(find_year).fillna("N/A").sort_values()

86904     2000
111191    2000
          ... 
89561      N/A
13550      N/A
Name: title, Length: 129971, dtype: object