In [None]:
## Import pandas
import pandas as pd

#### Checking Missing values in dataFrame for a particluar column

In [None]:
data.isnull()

In [None]:
## count how much NAN value in each column
data.isnull().sum()

#### .iloc() function

.iloc() function enables us to select a particular cell of the dataset that helps us select a value that belongs to a particular row or column from a set of values of a data frame or dataset.

In [None]:
data.iloc[0:10] ## retrieve starting 10 rows

In [None]:
## retrieve first 5 rows and 5th, 6th, 7th columns of data frame
data.iloc[0:5,5:8]

In [None]:
## retrieve first two columns of data frame with all rows
data.iloc[:,0:2]

*** .iloc returns a Pandas Series when one row is selected, and a Pandas DataFrame when multiple rows are selected, or if any column in full is selected.***

#### .loc() function

Pandas loc indexer can be used with DataFrames for two different use cases :-

    a) Selecting rows by label/index
    
    b) Selecting rows with a boolean / conditional lookup

The loc indexer is used with the same syntax as iloc: data.loc[<row selection>, <column selection>] 

In [None]:
## retreive first row data
data.loc[0]

In [None]:
## Retrive first 5 rows and 'Scheduled Castes - 2007-08','Scheduled Castes - 2004-05' column from datset
data.loc[0:5,['Scheduled Castes - 2007-08','Scheduled Castes - 2004-05']]

In [None]:
## Retrieve data from datset whose "Category of States" is equal to "Union Territories". 
data.loc[data["Category of States"] == "Union Territories"]

In [None]:
## Retrieve data from datset whose "Scheduled Castes - 2004-05" is greater than 10. 
data.loc[data["Scheduled Castes - 2004-05"] > 10]

In [None]:
## select rows with id column between 20 and 30, and just return "Scheduled Castes-2004-05","Scheduled Castes - 2007-08","Scheduled Tribes - 2004-05","Scheduled Tribes - 2007-08" columns
data.loc[20:30,["Scheduled Castes - 2004-05","Scheduled Castes - 2007-08","Scheduled Tribes - 2004-05","Scheduled Tribes - 2007-08"]]

#### Dropping any rows with a DataFrame

In [None]:
import numpy as np

In [None]:
df = pd.DataFrame(np.arange(12).reshape(3, 4),columns=['A', 'B', 'C', 'D'])
df

In [None]:
### Drop columns
df.drop(['B', 'C'], axis=1)

In [None]:
### Drop column 'B'and 'C'
df.drop(columns=['B', 'C'])

In [None]:
### Drop a row by index
df.drop([0, 1])

In [None]:
df

In [None]:
df.drop(columns = ['B'],inplace = True)

In [None]:
df

#### dropna() function for deleting NAN values

using dropna() function we delete NAN values containing rows from the dataframe.

Some useful Parameters
------------------------

axis : {0 or 'index', 1 or 'columns'}, default 0
    
    * 0, or 'index' : Drop rows which contain missing values.
        
    * 1, or 'columns' : Drop columns which contain missing value.

how : {'any', 'all'}, default 'any'
    
    Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

    * 'any' : If any NA values are present, drop that row or column.
        
    * 'all' : If all values are NA, drop that row or column.

thresh : int, optional
    
    Require that many non-NA values.
    
inplace : bool, default False
    
    If True, do operation inplace and return None.

In [None]:
data.head(10)

In [None]:
data.dropna()

In [None]:
data.dropna(axis = 1,how = 'any')

In [None]:
data.dropna(axis = 1,thresh = 10)

In [None]:
data.dropna(thresh = 10)

#### Replacing a NAN value from a specific value

In [None]:
data.fillna(value = 'Vikas')

#### Replacing 'NAN' value with 'Mean Imputation Method'

In [None]:
#def Impute_mean(data,feature,mean):
    #data[feature+"_mean"] = data[feature].fillna(mean)


In [None]:
#y=data #creates pointer
for x in data.columns[2:]:
      y[x]=data[x].fillna(round(data[x].mean(),2))
y.head()

In [None]:
data.head()

In [None]:
mean = round(data["Scheduled Castes - 2004-05"].mean(),2)
mean

In [None]:
Impute_mean(data,"Scheduled Castes - 2004-05",mean)
data.head()

Same way for median and mode Imputation method.

Simply replace mean value with median and mode value of respective column value.

### Group by Method

Any group by operation involves one of the following operation on the original obect.They are :-
    
    1) splitting the object
    
    2) appling a function
    
    3) combining the result

In [None]:
data.groupby("Category of States")

In [None]:
data.groupby("Category of States").count()

In [None]:
data.groupby(["Category of States","Non Special Category States"]).count()

### Pandas Concatenation function

Concatenate pandas objects along a particular axis with optional set logic
along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.

In [None]:
df1 = pd.DataFrame([['a', 1], ['b', 2]],columns=['letter', 'number'])

In [None]:
df1

In [None]:
df2 = pd.DataFrame([['c', 3], ['d', 4]],columns=['letter', 'number'])
df2

In [None]:
pd.concat([df1, df2])

In [None]:
pd.concat([df1,df2],axis = 1)

In [None]:
### Combine ``DataFrame`` objects with overlapping columns and return everything. Columns outside the intersection will be filled with ``NaN`` values.
df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],columns=['letter', 'number', 'animal'])
df3

In [None]:
pd.concat([df1, df3])

In [None]:
### Combine ``DataFrame`` objects with overlapping columns and return only those that are shared by passing ``inner`` to the ``join`` keyword argument.
pd.concat([df1, df3], join="inner")

In [None]:
pd.concat([df1, df3], join="outer")

In [None]:
pd.concat([df1, df3], join="outer",ignore_index = True)