In [3]:
import pandas as pd
rv = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
pd.set_option('max_rows', 5)

In [4]:
rv.price.dtype

dtype('float64')

In [5]:
rv.dtype

AttributeError: 'DataFrame' object has no attribute 'dtype'

In [6]:
rv.dtypes

country        object
description    object
                ...  
variety        object
winery         object
Length: 13, dtype: object

It's possible to convert a column of one type into another wherever such a conversion makes sense by using the astype function. For example, we may transform the points column from its existing int64 data type into a float64 data type:

In [7]:
rv.points.dtype

dtype('int64')

In [8]:
rv.points.astype('float64')

0         87.0
1         87.0
          ... 
129969    90.0
129970    90.0
Name: points, Length: 129971, dtype: float64

In [9]:
rv.points.dtype

dtype('int64')

In [10]:
rv.index.dtype

dtype('int64')

**Missing data**

Entries missing values are given the value NaN, short for "Not a Number". For technical reasons these NaN values are always of the float64 dtype.

pandas provides some methods specific to missing data. To select NaN entreis you can use pd.isnull (or its companion pd.notnull). This is meant to be used thusly:

In [11]:
rv[rv.country.isnull()]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
913,,"Amber in color, this wine has aromas of peach ...",Asureti Valley,87,30.0,,,,Mike DeSimone,@worldwineguys,Gotsa Family Wines 2014 Asureti Valley Chinuri,Chinuri,Gotsa Family Wines
3131,,"Soft, fruity and juicy, this is a pleasant, si...",Partager,83,,,,,Roger Voss,@vossroger,Barton & Guestier NV Partager Red,Red Blend,Barton & Guestier
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129590,,"A blend of 60% Syrah, 30% Cabernet Sauvignon a...",Shah,90,30.0,,,,Mike DeSimone,@worldwineguys,Büyülübağ 2012 Shah Red,Red Blend,Büyülübağ
129900,,This wine offers a delightful bouquet of black...,,91,32.0,,,,Mike DeSimone,@worldwineguys,Psagot 2014 Merlot,Merlot,Psagot


In [16]:
rv.country.fillna('country abhi available nhi h')

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

In [15]:
rv.country.fillna('country abhi available nhi h')[913]

'country abhi available nhi h'

In [17]:
rv.region_2.fillna("Unknownnnnnnnn")

0         Unknownnnnnnnn
1         Unknownnnnnnnn
               ...      
129969    Unknownnnnnnnn
129970    Unknownnnnnnnn
Name: region_2, Length: 129971, dtype: object

**Replace**

In [19]:
rv.taster_twitter_handle.replace("@kerinokeefe", "@kerino")

0            @kerino
1         @vossroger
             ...    
129969    @vossroger
129970    @vossroger
Name: taster_twitter_handle, Length: 129971, dtype: object

**Exercise 4**: What are the most common wine-producing regions? Create a `Series` counting the number of times each value occurs in the `region_1` field. This field is often missing data, so replace missing values with `Unknown`. Sort in descending order.  Your output should look something like this:

```
Unknown                    21247
Napa Valley                 4480
                           ...  
Bardolino Superiore            1
Primitivo del Tarantino        1
Name: region_1, Length: 1230, dtype: int64

In [22]:
rv.region_1.value_counts()

Napa Valley             4480
Columbia Valley (WA)    4124
                        ... 
McDowell Valley            1
Henty                      1
Name: region_1, Length: 1229, dtype: int64

In [21]:
reviews.region_1.fillna("Unkonwn").value_counts()

Unkonwn            21247
Napa Valley         4480
                   ...  
McDowell Valley        1
Henty                  1
Name: region_1, Length: 1230, dtype: int64

**Exercise 5**: A neat property of boolean data types, like the ones created by the `isnull()` method, is that `False` gets treated as 0 and `True` as 1 when performing math on the values. Thus, the `sum()` of a list of boolean values will return how many times `True` appears in that list.
Create a `pandas` `Series` showing how many times each of the columns in the dataset contains null values. Your result should look something like this:

```
country        63
description     0
               ..
variety         1
winery          0
Length: 13, dtype: int64
```

Hint: write a map that will extract the vintage of each wine in the dataset. The vintages reviewed range from 2000 to 2017, no earlier or later. Use `fillna` to impute the missing values.

In [24]:
rv.country.isnull().sum()

63

In [25]:
rv.isnull().sum()

country        63
description     0
               ..
variety         1
winery          0
Length: 13, dtype: int64

In [None]:
end