# Introduction
The pandas library has emerged into a power house of data manipulation tasks in python since it was developed in 2008. With its intuitive syntax and flexible data structure, it's easy to learn and enables faster data computation. The development of numpy and pandas libraries has extended python's multi-purpose nature to solve machine learning problems as well. The acceptance of python language in machine learning has been phenomenal since then.
This notebook talks about using numpy and pandas libraries for data manipulation from scratch.


## Tabel of Contents
1. Some important points about Numpy and Pandas
2. Beginning with Numpy
3. Beginning with Pandas
4. Exploring a Machine Learning Data Set
5. Building a Random Forest Model

## Some important points about Numpy and Panda
1. Data manipulation capabiltiies of Pandas is built on top of NumPy. So pandas can be stated as a dependency of numpy.
2. Pandas is best at handling tabular data sets comprising different variable types (integer, float, double, etc.). In addition, the pandas library can also be used to perform even the most naive of tasks such as loading data or doing feature engineering on time series data.
3. Numpy is most suitable and generally used for performing basic numerical and statistical computations such as mean, median, range, etc. Alongside, it also supports the creation of multi-dimensional arrays.
4. Numpy library can also be used to integrate C/C++ and Fortran code.


### Beginning with numpy

In [1]:
import numpy as np    #importing the numpy module as np
np.__version__        #checking the version of numpy

'1.11.3'

In [2]:
L=list(range(10))  #creating a list with range 0-9
print(L)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [3]:
[str(c) for c in L]    #converting integers to list. this is called list comprehension.

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [4]:
[type(item) for item in L]  #finding out the data-type of each element in list

[int, int, int, int, int, int, int, int, int, int]

#### Let's create some arrays
Numpy array are homogeneous in nature. This means that the elements that they contain can only be of on data-type. All the elements have to be either int, float, double, etc

In [5]:
#creating an array with all elements as 0
np.zeros(10, dtype='int')   #10 defines the range of array and dtype declares the data type as integer

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [6]:
#creating a 4 row x 3 column matrix with all elements as 1
np.ones((4,3), dtype='int')

array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]])

In [7]:
#creating a matrix with predefined values 
np.full((3,5),215.4553)

array([[ 215.4553,  215.4553,  215.4553,  215.4553,  215.4553],
       [ 215.4553,  215.4553,  215.4553,  215.4553,  215.4553],
       [ 215.4553,  215.4553,  215.4553,  215.4553,  215.4553]])

In [8]:
# creatinga n array with set sequence
np.arange(0,20,2)     
#arguement explaination: array should start from 0, end at 20 and diffrence between each element should be 2

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [9]:
#create an array of even space between the given range of values
np.linspace(0,1,5)
#arguement explaination: array should start at 0, should end at 1 and lay 5 elements between 0 and 1 with equal distribution.

array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])

In [10]:
#creating a 3x3 array with mean=0 and standard deviation=1 in a given dimension. Normal(Gaussian) distribution in this case
np.random.normal(0,1,(3,3))

array([[ 0.55829552, -0.08415752,  1.39399486],
       [-0.86130653,  2.0297083 , -1.03483405],
       [ 0.60388751, -0.2277207 , -0.1939952 ]])

In [11]:
#creating an identity matrix
np.eye(3) #3 is to define that the matrix will be 3x3

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

In [12]:
#setting a random seed
np.random.seed(0)

x1 = np.random.randint(10, size=6) #one dimension
x2 = np.random.randint(10, size=(3,4)) #two dimension
x3 = np.random.randint(10, size=(3,4,5)) #three dimension

print("x1 ndim:", x2.ndim)
print("x1 shape:", x2.shape)
print("x1 size: ", x2.size)
print("x2 ndim:", x2.ndim)
print("x2 shape:", x2.shape)
print("x2 size: ", x2.size)
print("x3 ndim:", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)

x1 ndim: 2
x1 shape: (3, 4)
x1 size:  12
x2 ndim: 2
x2 shape: (3, 4)
x2 size:  12
x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size:  60


#### Array indexing


In [13]:
#creating a numpy array is simple
x1=np.array([74,83,45,22,12])
x1

array([74, 83, 45, 22, 12])

In [14]:
#we can access any element of the numpy array just as we do in python
x1[1]

83

In [15]:
x1[-1]  #gets last element. Similarly -2 will get scond from last and so on

12

In [16]:
#creating a multidimensional array is also easy
x2=np.array([[1,2,3],[4,5,6],[7,8,9]])
x2

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [17]:
print(x2[2,0],x2[0,2])   #accessing the elemnts of a matrix like this

7 3


In [18]:
#3rd row and last value from the 3rd column
x2[2,-1]

9

#### Array Slicing
Now we will try acessing multiple or a range of elemetns from an array


In [19]:
x=np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [20]:
#from start till 4th postition
x[:5]

array([0, 1, 2, 3, 4])

In [21]:
#from 4th to end
x[4:]

array([4, 5, 6, 7, 8, 9])

In [22]:
#from 4th to 6th 
x[4:7]

array([4, 5, 6])

In [23]:
#return elelments at even place
x[::2]

array([0, 2, 4, 6, 8])

In [24]:
#return elelments from first position and step by two\
x[1::2]

array([1, 3, 5, 7, 9])

In [25]:
#reversing the array
x[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

#### Array Concatenation
Combining arrays to make tasks easier and avoid making new arrays

In [26]:
#we can concatenate more than 2 arrays at once
x=np.array([1,2,3])
y=np.array([4,5,6])
z=np.array([7,8,9])
np.concatenate([x,y,z])

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [27]:
#a cooler way to produce multi-dimensional array
multiArray=np.array([[23,24,25],[12,13,14]])
np.concatenate([multiArray,multiArray])

array([[23, 24, 25],
       [12, 13, 14],
       [23, 24, 25],
       [12, 13, 14]])

In [28]:
#using concatenate's axis parameter, we can define row-wise or column-wise matrix
np.concatenate([multiArray,multiArray],axis=1)

array([[23, 24, 25, 23, 24, 25],
       [12, 13, 14, 12, 13, 14]])

np.concatenate() is undoubtedly epic for conccatenating arrays of equal dimensions.
But, what if we have to combine a 2D array and a 1D array? 
This is where hstack() and vstack() come into play.


In [29]:
x=np.array([3,4,5])
grid=np.array([[1,2,3],[17,18,19]])
np.vstack([x,grid])

array([[ 3,  4,  5],
       [ 1,  2,  3],
       [17, 18, 19]])

In [30]:
#similaraly make a horizontal stack using hstack
y=np.array([[9],[9]])
np.hstack([grid,y])

array([[ 1,  2,  3,  9],
       [17, 18, 19,  9]])

#### Now let's see how can we split arrays based on pre-defined positions

In [31]:
x=np.arange(10)
x1,x2,x3=np.split(x,[3,6])
print(x1,x2,x3)

[0 1 2] [3 4 5] [6 7 8 9]


In [32]:
grid = np.arange(16).reshape((4,4))
upper,lower = np.vsplit(grid,[2])
print (upper, lower)

[[0 1 2 3]
 [4 5 6 7]] [[ 8  9 10 11]
 [12 13 14 15]]


### Beginning with Pandas

In [33]:
import pandas as pd

In [34]:
#creating a data frame. Dictionary is used to define the structure of a data-frame here keys act as  column names and values as row values
data=pd.DataFrame({'Country':['Russia','Colombia','Chile','Equador','Nigeria'],'Rank':[121,40,100,130,11]})
#doing a data

In [35]:
#viewing a quick summary staistics of data-set we are using 
data.describe()

Unnamed: 0,Rank
count,5.0
mean,80.4
std,52.300096
min,11.0
25%,40.0
50%,100.0
75%,121.0
max,130.0


In [36]:
#getting complete information about the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
Country    5 non-null object
Rank       5 non-null int64
dtypes: int64(1), object(1)
memory usage: 160.0+ bytes


In [37]:
#creating another dataframe 
data = pd.DataFrame({'group':['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],'ounces':[4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,group,ounces
0,a,4.0
1,a,3.0
2,a,12.0
3,b,6.0
4,b,7.5
5,b,8.0
6,c,3.0
7,c,5.0
8,c,6.0


In [38]:
#sorting the data frame by ounces 
data.sort_values(by=['ounces'],ascending=True,inplace=False)    #inplace = True will make changes to data

Unnamed: 0,group,ounces
1,a,3.0
6,c,3.0
0,a,4.0
7,c,5.0
3,b,6.0
8,c,6.0
4,b,7.5
5,b,8.0
2,a,12.0


In [39]:
#sorting the data by multiple columns at once 
data.sort_values(by=['group','ounces'],ascending=[True,False],inplace=False)

Unnamed: 0,group,ounces
2,a,12.0
0,a,4.0
1,a,3.0
5,b,8.0
4,b,7.5
3,b,6.0
8,c,6.0
7,c,5.0
6,c,3.0


Sometimes we get datasets with duplicate rows, whcih is noise. This is whuy, before we train a model, we gotta make sure we get rid of such incosistencies and noise. <b>We will now try to remove duplicate rows from a dataset</b>

In [40]:
#new data
data=pd.DataFrame({'k1':['one']*3+['two']*4,'k2':[3,2,1,3,3,4,4]})
data

Unnamed: 0,k1,k2
0,one,3
1,one,2
2,one,1
3,two,3
4,two,3
5,two,4
6,two,4


In [41]:
#sorting the data
data.sort_values(by='k2')

Unnamed: 0,k1,k2
2,one,1
1,one,2
0,one,3
3,two,3
4,two,3
5,two,4
6,two,4


In [42]:
data.drop_duplicates()  #yes removing dupliactes is this simple.

Unnamed: 0,k1,k2
0,one,3
1,one,2
2,one,1
3,two,3
5,two,4


Removing duplicates based on matching row values across all columns. We can also remove duplicates based on a particular columns. <b>k1</b> in this case.

In [43]:
data.drop_duplicates(subset='k1')

Unnamed: 0,k1,k2
0,one,3
3,two,3


Let's categorize rows on a predefined criteria. this generally happens while data processing where we need to categorize a variable

In [44]:
#defining new data
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'pastrami','corned beef', 'Bacon', 'pastrami', 'honey ham','nova lox'],'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Now we wanna create a new variable which indicates the type of animal that acts as the source of the food. So firstly, we will create a new dictionary to map the food to the animals. Then we will use map function to map dictionary's values to the keys.

In [45]:
meat_to_animal={
    'bacon':'pig',
    'pulled pork':'pig',
    'corned beef':'cow',
    'pastrami':'cow',
    'honey ham':'pig',
    'nova lox':'salmon'
}
def meat_to_animal_func(series):
    if series['food'] == 'bacon':
        return 'pig'
    elif series['food'] == 'pulled pork':
        return 'pig'
    elif series['food'] == 'pastrami':
        return 'cow'
    elif series['food'] == 'corned beef':
        return 'cow'
    elif series['food'] == 'honey ham':
        return 'pig'
    else:
        return 'salmon'
    
#creating the new variable 'animal' 
data['animal']=data['food'].map(str.lower).map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


We could do it another way also i.e we can convert the food values to the lower case and apply the fucntion 'meat_to_animal_func'

In [46]:
lower = lambda x: x.lower()
data['food']=data['food'].apply(lower)
data['animal2']=data.apply(meat_to_animal_func, axis='columns')
data

Unnamed: 0,food,ounces,animal,animal2
0,bacon,4.0,pig,pig
1,pulled pork,3.0,pig,pig
2,bacon,12.0,pig,pig
3,pastrami,6.0,cow,cow
4,corned beef,7.5,cow,cow
5,bacon,8.0,pig,pig
6,pastrami,3.0,cow,cow
7,honey ham,5.0,pig,pig
8,nova lox,6.0,salmon,salmon


Another way to create a new variable is by using the 'assign' function.

In [47]:
data.assign(new_variable=data['ounces']*10)   #this is pretty self-explainatory 

Unnamed: 0,food,ounces,animal,animal2,new_variable
0,bacon,4.0,pig,pig,40.0
1,pulled pork,3.0,pig,pig,30.0
2,bacon,12.0,pig,pig,120.0
3,pastrami,6.0,cow,cow,60.0
4,corned beef,7.5,cow,cow,75.0
5,bacon,8.0,pig,pig,80.0
6,pastrami,3.0,cow,cow,30.0
7,honey ham,5.0,pig,pig,50.0
8,nova lox,6.0,salmon,salmon,60.0


Now let us see how to remove an element from the data frame.

In [48]:
#removing the unnecessary 'animal2' variable 
data.drop('animal2', axis='columns',inplace='True')
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


We generally find missing values in our data set. A quick method for imputing missing values in our data is by filling value with any random number. 
This is not only for missing values. There will be alot of times when we will also find a lot of outliers in our data-set which might have to be replaced. Let's try to replace some values

In [53]:
data=pd.Series([1,2.2,-321.1,124,2.5,-6.6,-321.1,108,911])
data

0      1.0
1      2.2
2   -321.1
3    124.0
4      2.5
5     -6.6
6   -321.1
7    108.0
8    911.0
dtype: float64

In [55]:
data.replace(-321.1,np.nan,inplace=True)   #replacing -321.1 with NaN values
data

0      1.0
1      2.2
2      NaN
3    124.0
4      2.5
5     -6.6
6      NaN
7    108.0
8    911.0
dtype: float64

In [56]:
#replacing multiple values at once 
data.replace([124.0,-6.6],np.nan,inplace=True)
data

0      1.0
1      2.2
2      NaN
3      NaN
4      2.5
5      NaN
6      NaN
7    108.0
8    911.0
dtype: float64

Let's rename column names and row names(axis)

In [59]:
data=pd.DataFrame(np.arange(12).reshape((3,4)), index=['Ohio','Colarado','New York'], columns =['one','two','three','four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colarado,4,5,6,7
New York,8,9,10,11


In [63]:
#using rename() fucntion to renam e columns 
data.rename(index={'Ohio':"San Fransisco"}, columns={'one':'one_p','two':'two_p'},inplace='True')

In [64]:
data

Unnamed: 0,one_p,two_p,three,four
SAn Fransisco,0,1,2,3
Colarado,4,5,6,7
New York,8,9,10,11


In [66]:
#using string functions to rename column names
data.rename(index=str.upper,columns=str.title,inplace=True)
data

Unnamed: 0,One_P,Two_P,Three,Four
SAN FRANSISCO,0,1,2,3
COLARADO,4,5,6,7
NEW YORK,8,9,10,11


Let us now, categorize(bin) continuous variables .

In [71]:
ages=[20,22,25,19,25,92,45,44,70,79,20,67,31]

Now, let's divide the ages into bins such as 18-25,26-35 and so on..

In [72]:
bins=[18,25,35,60,100]   #means 18-25, 25-35, 35-60, 60-100
cats=pd.cut(ages,bins)
cats   #in the output, '(' means the value is included in the bin, '[' means the value is excluded

[(18, 25], (18, 25], (18, 25], (18, 25], (18, 25], ..., (60, 100], (60, 100], (18, 25], (60, 100], (25, 35]]
Length: 13
Categories (4, object): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [73]:
#to include the right bin value, we can do:
pd.cut(ages,bins,right=False)

[[18, 25), [18, 25), [25, 35), [18, 25), [25, 35), ..., [60, 100), [60, 100), [18, 25), [60, 100), [25, 35)]
Length: 13
Categories (4, object): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

In [76]:
#pandas library intrinsically assigns an encoding to categorical variables.
cats.codes

array([0, 0, 0, 0, 0, 3, 2, 2, 3, 3, 0, 3, 1], dtype=int8)

In [78]:
#checking how many observations fall under each bin
pd.value_counts(cats)

(18, 25]     6
(60, 100]    4
(35, 60]     2
(25, 35]     1
dtype: int64

We can pass a unique name to each label of the bins

In [80]:
bin_names=['Youth','Young Adult','Middle Age', 'Senior']
new_cats=pd.cut(ages,bins,labels=bin_names)
pd.value_counts(new_cats)

Youth          6
Senior         4
Middle Age     2
Young Adult    1
dtype: int64

We can also calculate their cumulative sum

In [81]:
pd.value_counts(new_cats).cumsum()

Youth           6
Senior         10
Middle Age     12
Young Adult    13
dtype: int64

We will now learn a thing or two about grouping data and creating pivots in pandas. This is one of the most import data analysis method. 

In [None]:
df=pd.DataFrame