## Missing Data
There will be times, when the data you  have imported has missing values, NaN or zero values. You can take care of that be following any/some of these steps.

In [4]:
#import pandas
import pandas as pd

## Reading a CSV

In [5]:
car_dataset_url ='https://raw.githubusercontent.com/ankitind/sample_datasets/master/car_ad.csv'
car_ads = pd.read_csv(car_dataset_url, header=0)


Content

**Dataset contains 9576 rows and 10 variables with essential meanings:**

- car: manufacturer brand
- price: seller’s price in advertisement (in USD)
- body: car body type
- mileage: as mentioned in advertisement (‘000 Km)
- engV: rounded engine volume (‘000 cubic cm)
- engType: type of fuel (“Other” in this case should be treated as NA)
- registration: whether car registered in Ukraine or not
- year: year of production
- model: specific model name
- drive: drive type

---

## Lets understand the data first
### Methods
First analyse the data using the .head() .tail(), .info() methods. However, you'll find out very quickly that the printed results don't allow you to see everything you need, since there are too many columns.
### Properties
Therefore, you need to look at the data in another way -
 .shape, .columns, .dtypes

In [5]:
car_ads.head()

Unnamed: 0,car,price,body,mileage,engV,engType,registration,year,model,drive
0,Ford,15500.0,crossover,68,2.5,Gas,yes,2010,Kuga,full
1,Mercedes-Benz,20500.0,sedan,173,1.8,Gas,yes,2011,E-Class,rear
2,Mercedes-Benz,35000.0,other,135,5.5,Petrol,yes,2008,CL 550,rear
3,Mercedes-Benz,17800.0,van,162,1.8,Diesel,yes,2012,B 180,front
4,Mercedes-Benz,33000.0,vagon,91,,Other,yes,2013,E-Class,


In [4]:
car_ads.tail()

Unnamed: 0,car,price,body,mileage,engV,engType,registration,year,model,drive
9571,Hyundai,14500.0,crossover,140,2.0,Gas,yes,2011,Tucson,front
9572,Volkswagen,2200.0,vagon,150,1.6,Petrol,yes,1986,Passat B2,front
9573,Mercedes-Benz,18500.0,crossover,180,3.5,Petrol,yes,2008,ML 350,full
9574,Lexus,16999.0,sedan,150,3.5,Gas,yes,2008,ES 350,front
9575,Audi,22500.0,other,71,3.6,Petrol,yes,2007,Q7,full


In [6]:
car_ads.shape

(9576, 10)

In [7]:
car_ads.columns

Index(['car', 'price', 'body', 'mileage', 'engV', 'engType', 'registration',
       'year', 'model', 'drive'],
      dtype='object')

In [15]:
car_ads.dtypes

car              object
price           float64
body             object
mileage           int64
engV            float64
engType          object
registration     object
year              int64
model            object
drive            object
dtype: object

In [26]:
# The .info() method provides important information about a DataFrame, 
# such as the number of rows, number of columns, number of non-missing values in each column, 
# and the data type stored in each column.
car_ads.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9576 entries, 0 to 9575
Data columns (total 10 columns):
car             9576 non-null object
price           9576 non-null float64
body            9576 non-null object
mileage         9576 non-null int64
engV            9142 non-null float64
engType         9576 non-null object
registration    9576 non-null object
year            9576 non-null int64
model           9576 non-null object
drive           9065 non-null object
dtypes: float64(2), int64(2), object(6)
memory usage: 748.2+ KB


---

### Describe
Get stastistic info about columns which have numerical values

In [10]:
car_ads.describe()

Unnamed: 0,price,mileage,engV,year
count,9576.0,9576.0,9142.0,9576.0
mean,15633.317316,138.862364,2.646344,2006.605994
std,24106.523436,98.629754,5.927699,7.067924
min,0.0,0.0,0.1,1953.0
25%,4999.0,70.0,1.6,2004.0
50%,9200.0,128.0,2.0,2008.0
75%,16700.0,194.0,2.5,2012.0
max,547800.0,999.0,99.99,2016.0


In [11]:
## Dropping rows with NaN values
car_ads.dropna().describe()

Unnamed: 0,price,mileage,engV,year
count,8739.0,8739.0,8739.0,8739.0
mean,15733.542261,140.095434,2.588607,2006.609681
std,24252.90481,97.892213,5.41667,6.968947
min,0.0,0.0,0.1,1959.0
25%,5000.0,71.0,1.6,2004.0
50%,9250.0,130.0,2.0,2008.0
75%,16800.0,195.5,2.5,2012.0
max,547800.0,999.0,99.99,2016.0


In [21]:
car_ads.dropna().describe(include=['object'])

Unnamed: 0,car,body,engType,registration,model,drive
count,8739,8739,8739,8739,8739,8739
unique,83,6,4,2,827,3
top,Volkswagen,sedan,Petrol,yes,E-Class,front
freq,860,3321,4065,8236,182,4973


---

### Value.Count() for non numerical columns
You want to set the dropna column to False so if there are missing values in a column, it will give you the frequency counts.

In [15]:
car_ads['car'].value_counts().head()

Volkswagen       936
Mercedes-Benz    921
BMW              694
Toyota           541
VAZ              489
Name: car, dtype: int64

In [16]:
    car_ads['car'].value_counts(dropna=False).head()

Volkswagen       936
Mercedes-Benz    921
BMW              694
Toyota           541
VAZ              489
Name: car, dtype: int64

In [6]:
car_ads['body'].value_counts(dropna=True)

sedan        3646
crossover    2069
hatch        1252
van          1049
other         838
vagon         722
Name: body, dtype: int64

In [11]:
car_ads['engType'].value_counts(dropna=False)

Petrol    4379
Diesel    3013
Gas       1722
Other      462
Name: engType, dtype: int64

In [22]:
car_ads['registration'].value_counts(dropna=False)

yes    9015
no      561
Name: registration, dtype: int64

In [24]:
car_ads['model'].value_counts(dropna=False).head()

E-Class                   199
A6                        172
Vito ����.                171
Kangoo ����.              146
Camry                     134
Lanos                     127
X5                        119
Caddy ����.               118
Octavia A5                108
Accord                     90
Megane                     88
Aveo                       80
520                        80
Trafic ����.               77
Land Cruiser Prado         76
Fabia                      75
Touareg                    69
Tucson                     69
Range Rover                68
Accent                     68
Passat B6                  66
Lacetti                    64
6                          64
Focus                      62
Superb                     61
Cayenne                    61
Vivaro ����.               61
T5 (Transporter) ����.     61
320                        61
Polo                       60
                         ... 
LandMark                    1
Juke Nismo                  1
NX 300    

In [None]:
car_ads['drive'].value_counts(dropna=False)

---

### Finding rows(observations) with null or NaN values

In [249]:
car_ads[pd.isnull(car_ads['drive'])].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 511 entries, 4 to 9566
Data columns (total 10 columns):
car             511 non-null object
price           511 non-null float64
body            511 non-null object
mileage         511 non-null int64
engV            403 non-null float64
engType         511 non-null object
registration    511 non-null object
year            511 non-null int64
model           511 non-null object
drive           0 non-null object
dtypes: float64(2), int64(2), object(6)
memory usage: 43.9+ KB


In [27]:
car_ads[pd.isnull(car_ads['drive'])]

Unnamed: 0,car,price,body,mileage,engV,engType,registration,year,model,drive
4,Mercedes-Benz,33000.00,vagon,91,,Other,yes,2013,E-Class,
37,Audi,2850.00,sedan,260,,Other,no,1999,A6,
44,BMW,39333.00,sedan,6,2.00,Petrol,yes,2016,520,
52,Mercedes-Benz,31500.00,sedan,123,2.20,Diesel,yes,2011,E-Class,
103,Volkswagen,10000.00,van,231,1.90,Diesel,yes,2005,T5 (Transporter) ����.,
109,Nissan,12400.00,hatch,26,,Other,yes,2011,Leaf,
119,Mercedes-Benz,29500.00,sedan,37,1.80,Petrol,yes,2012,E-Class,
137,Mercedes-Benz,93555.00,crossover,0,,Other,yes,2016,GLS 350,
154,Nissan,17700.00,crossover,40,1.60,Petrol,yes,2014,Qashqai,
163,Mercedes-Benz,17900.00,van,167,2.20,Diesel,yes,2012,Vito ����.,


##  Series methods .all() or any()
### Columns than have all non-zero Values

In [266]:
car_ads.all()

car              True
price           False
body             True
mileage         False
engV             True
engType          True
registration     True
year             True
model            True
drive            True
dtype: bool

### Columns than have any non-zero Values

In [267]:
car_ads.any()

car             True
price           True
body            True
mileage         True
engV            True
engType         True
registration    True
year            True
model           True
drive           True
dtype: bool

### Which rows or columns have NaN

In [269]:
car_ads.isnull().any()

car             False
price           False
body            False
mileage         False
engV             True
engType         False
registration    False
year            False
model           False
drive            True
dtype: bool

### Which rows or columns have no Nan

In [271]:
car_ads.notnull().all()

car              True
price            True
body             True
mileage          True
engV            False
engType          True
registration     True
year             True
model            True
drive           False
dtype: bool

### Drop Rows with NaN dropna()

In [289]:
car_ads_no_missing_dropped_any = car_ads.dropna(how='any')
car_ads_no_missing_dropped_all = car_ads.dropna(how='all')
car_ads_no_missing_dropped_any.info()
print("---")
car_ads_no_missing_dropped_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8739 entries, 0 to 9575
Data columns (total 10 columns):
car             8739 non-null object
price           8739 non-null float64
body            8739 non-null object
mileage         8739 non-null int64
engV            8739 non-null float64
engType         8739 non-null object
registration    8739 non-null object
year            8739 non-null int64
model           8739 non-null object
drive           8739 non-null object
dtypes: float64(2), int64(2), object(6)
memory usage: 751.0+ KB
---
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9576 entries, 0 to 9575
Data columns (total 10 columns):
car             9576 non-null object
price           9576 non-null float64
body            9576 non-null object
mileage         9576 non-null int64
engV            9142 non-null float64
engType         9576 non-null object
registration    9576 non-null object
year            9576 non-null int64
model           9576 non-null object
drive          

### Using Thresholf in dataframes
thresh= keyword argument to drop columns from the full dataset that have more than 1000 missing values.

In [318]:
print(car_ads.head())
print("---")
car_ads_no_missing_dropped_col = car_ads.dropna(thresh=9500, axis='columns')
print(car_ads_no_missing_dropped_col.head())

             car    price       body  mileage  engV engType registration  \
0           Ford  15500.0  crossover       68   2.5     Gas          yes   
1  Mercedes-Benz  20500.0      sedan      173   1.8     Gas          yes   
2  Mercedes-Benz  35000.0      other      135   5.5  Petrol          yes   
3  Mercedes-Benz  17800.0        van      162   1.8  Diesel          yes   
4  Mercedes-Benz  33000.0      vagon       91   NaN   Other          yes   

   year    model  drive  
0  2010     Kuga   full  
1  2011  E-Class   rear  
2  2008   CL 550   rear  
3  2012    B 180  front  
4  2013  E-Class    NaN  
---
             car    price       body  mileage engType registration  year  \
0           Ford  15500.0  crossover       68     Gas          yes  2010   
1  Mercedes-Benz  20500.0      sedan      173     Gas          yes  2011   
2  Mercedes-Benz  35000.0      other      135  Petrol          yes  2008   
3  Mercedes-Benz  17800.0        van      162  Diesel          yes  2012   
4  