# Data Types & Missing Values
* Sometimes, the given data has a lot of nulls, and missing data.
* We need to handle them before we can use the data for our analysis.
* The data type for a column in a Dataframe or a series is known as the dtype.
* we can access it using dtype property

In [2]:
# importing the data on which we are going to work.
import pandas as pd
reviews = pd.read_csv('winemag_data\winemag-data-130k-v2.csv')

In [3]:
# check on the datatype of the price columns
reviews.price.dtype

dtype('float64')

In [4]:
# we can check on all columns data type using dtypes directly from the main dataframe
reviews.dtypes

# note that, if the dtype was string it will appear as object, and if there were a mix of dtypes, they will be objects too.

Unnamed: 0                 int64
country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

## Converting Data Types
* We can convert a column whole data type from one to another, if this conversion makes sense.
* like to convert from int64 to float64, or vice versa.
* to do so, we use astype() function.

In [5]:
reviews.price.astype('float16')

0          NaN
1         15.0
2         14.0
3         13.0
4         65.0
          ... 
129966    28.0
129967    75.0
129968    30.0
129969    32.0
129970    21.0
Name: price, Length: 129971, dtype: float16

In [6]:
reviews.index.dtype # the index also has its own datatype
#? lets convert it to int32 because 64 is too much
reviews.index.astype('int32')

Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            129961, 129962, 129963, 129964, 129965, 129966, 129967, 129968,
            129969, 129970],
           dtype='int64', length=129971)

## Dealing with the missing data
* Enrties with missing values are given NaN value.
* NaN stands for Not a Number.
* for technical reasons, NaN is always a float64 -> because it is define when we divide by 0 or something like this which may cause inf, so it may represent very large number.
* panda provides some methods specificlly for missing values.
* To select Nan entries we can use pd.isnull() or pd.notnull()
* we send the column we want to get all its null values to the isnull function.
* then to get all elements which have null, we use the result of the isnull function as an index to the dataframe.

In [7]:
# this is how we get all the elements, which has no countries stated.
reviews[pd.isnull(reviews.country)]

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
913,913,,"Amber in color, this wine has aromas of peach ...",Asureti Valley,87,30.0,,,,Mike DeSimone,@worldwineguys,Gotsa Family Wines 2014 Asureti Valley Chinuri,Chinuri,Gotsa Family Wines
3131,3131,,"Soft, fruity and juicy, this is a pleasant, si...",Partager,83,,,,,Roger Voss,@vossroger,Barton & Guestier NV Partager Red,Red Blend,Barton & Guestier
4243,4243,,"Violet-red in color, this semisweet wine has a...",Red Naturally Semi-Sweet,88,18.0,,,,Mike DeSimone,@worldwineguys,Kakhetia Traditional Winemaking 2012 Red Natur...,Ojaleshi,Kakhetia Traditional Winemaking
9509,9509,,This mouthwatering blend starts with a nose of...,Theopetra Malagouzia-Assyrtiko,92,28.0,,,,Susan Kostrzewa,@suskostrzewa,Tsililis 2015 Theopetra Malagouzia-Assyrtiko W...,White Blend,Tsililis
9750,9750,,This orange-style wine has a cloudy yellow-gol...,Orange Nikolaevo Vineyard,89,28.0,,,,Jeff Jenssen,@worldwineguys,Ross-idi 2015 Orange Nikolaevo Vineyard Chardo...,Chardonnay,Ross-idi
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124176,124176,,This Swiss red blend is composed of four varie...,Les Romaines,90,30.0,,,,Jeff Jenssen,@worldwineguys,Les Frères Dutruy 2014 Les Romaines Red,Red Blend,Les Frères Dutruy
129407,129407,,Dry spicy aromas of dusty plum and tomato add ...,Reserve,89,22.0,,,,Michael Schachner,@wineschach,El Capricho 2015 Reserve Cabernet Sauvignon,Cabernet Sauvignon,El Capricho
129408,129408,,El Capricho is one of Uruguay's more consisten...,Reserve,89,22.0,,,,Michael Schachner,@wineschach,El Capricho 2015 Reserve Tempranillo,Tempranillo,El Capricho
129590,129590,,"A blend of 60% Syrah, 30% Cabernet Sauvignon a...",Shah,90,30.0,,,,Mike DeSimone,@worldwineguys,Büyülübağ 2012 Shah Red,Red Blend,Büyülübağ


## Replacing the missing values
* we can replace the missing values with any value we want.
* we can use fillna() function to do so.
* we can pass the value we want to replace the missing values with to the fillna() function.
* we can also pass the method we want to use to fill the missing values.
* like ffill, which fills the missing values with the previous value in the column.
* or bfill, which fills the missing values with the next value in the column.
* we can also pass the axis we want to fill the missing values in.
* axis = 0 -> fill the missing values in the column.
* axis = 1 -> fill the missing values in the row.
* we can also pass the inplace parameter to fill the missing values in the dataframe itself.
* we can also pass the limit parameter to limit the number of missing values to be filled.
* we can also pass the limit_direction parameter to specify the direction of filling the missing values.
* usually, we send aggregation functions to the fillna() function to fill the missing values with the mean, median, or mode of the column.
* usually we use mean or median with numerical data, and mod with categorical data.

In [8]:
reviews.country.fillna(reviews.country.mod) # note that it is mod note mode :)
reviews.country.fillna('Unknowns') # note that it is mod note mode :)

0            Italy
1         Portugal
2               US
3               US
4               US
            ...   
129966     Germany
129967          US
129968      France
129969      France
129970      France
Name: country, Length: 129971, dtype: object