<a href="https://colab.research.google.com/github/bhargav23/AI/blob/master/Lab/2AB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Pandas

## What is Pandas?
*   Pandas is a Python library used for working with data sets.
*   It has functions for analyzing, cleaning, exploring, and manipulating data.
*   The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

### Pandas deals with the following three data structures -
*   Series
*   DataFrame

## What is a Series?
*   A Pandas Series is like a column in a table.
*   It is a one-dimensional array holding data of any type.

## What is a DataFrame?
*   A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns

##Creating Pandas Series
A pandas Series can be created using the following constructor -

```
pandas.Series( data, index, dtype)

```






### Parameter & Description

**data**
*   data takes various forms like ndarray, list, constants

**index**

*   Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is passed.

**dtype**

*   dtype is for data type. If None, data type will be inferred


In [1]:
import pandas as pd
import numpy as np

### Create an Empty Series
*  A basic series, which can be created is an Empty Series.

In [2]:
s = pd.Series(dtype=float)
print(s)

Series([], dtype: float64)


### Create a Series from ndarray
*  If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1].

In [3]:
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print(s)

0    a
1    b
2    c
3    d
dtype: object


### Create a Series from dict
*   A dict can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted order to construct index. If index is passed, the values in data corresponding to the labels in the index will be pulled out.

In [4]:
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print(s)

a    0.0
b    1.0
c    2.0
dtype: float64


In [5]:
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print (s)

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64


### Create a Series from Scalar
*   If data is a scalar value, an index must be provided. The value will be repeated to match the length of index

In [6]:
s = pd.Series(5, index=[0, 1, 2, 3])
print (s)

0    5
1    5
2    5
3    5
dtype: int64


### Accessing Data from Series with Position
*   Data in the series can be accessed similar to that in an ndarray.


In [7]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first element
print (s[0])

1


In [8]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first three element
print (s[:3])

a    1
b    2
c    3
dtype: int64


In [9]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the last three element
print (s[-3:])

c    3
d    4
e    5
dtype: int64


### Retrieve Data Using Label (Index)
*   A Series is like a fixed-size dict in that you can get and set values by index label.

In [10]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve a single element
print (s['a'])

1


In [11]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements
print (s[['a','c','d']])

a    1
c    3
d    4
dtype: int64


## Creating DataFrame
* A pandas DataFrame can be created using the following constructor -

```
pandas.DataFrame( data, index, columns, dtype)
```
**data**

*   data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.

**index**

*   For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if no index is passed.

**columns**

*   For column labels, the optional default syntax is - np.arange(n). This is only true if no index is passed.

**dtype**

*   Data type of each column.

### A pandas DataFrame can be created using various inputs like

*   Lists
*   dict
*   Series
*   Numpy ndarrays
*   Another DataFrame

### Create an Empty DataFrame


In [12]:
df = pd.DataFrame()
print (df)

Empty DataFrame
Columns: []
Index: []


### Create a DataFrame from Lists

In [13]:
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print (df)

   0
0  1
1  2
2  3
3  4
4  5


In [14]:
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print (df)

     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13


### Create a DataFrame from Dict of ndarrays / Lists
*   All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.

*   If no index is passed, then by default, index will be range(n), where n is the array length.

In [15]:
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print (df)

    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42


In [16]:
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print (df)

        Name  Age
rank1    Tom   28
rank2   Jack   34
rank3  Steve   29
rank4  Ricky   42


### Create a DataFrame from List of Dicts
*   List of Dictionaries can be passed as input data to create a DataFrame. 
*   The dictionary keys are by default taken as column names.

In [17]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print (df)

   a   b     c
0  1   2   NaN
1  5  10  20.0


In [18]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print (df)

        a   b     c
first   1   2   NaN
second  5  10  20.0


In [19]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print (df)

        a   b     c
first   1   2   NaN
second  5  10  20.0


In [20]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print (df1)
print (df2)

        a   b
first   1   2
second  5  10
        a  b1
first   1 NaN
second  5 NaN


### Create a DataFrame from Dict of Series

*   Dictionary of Series can be passed to form a DataFrame. 
*   The resultant index is the union of all the series indexes passed.


In [21]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print (df)

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4


### Column Selection

In [22]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print (df ['one'])

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64


## Adding, Deleting, Modifying the rows/columns in a dataframe
### Column Addition

In [23]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

# Adding a new column to an existing DataFrame object with column label by passing new series

# Adding a new column by passing as Series:
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print (df)

   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN


In [24]:
# Adding a new column using the existing columns in DataFrame:
df['four']=df['one']+df['three']

print (df)

   one  two  three  four
a  1.0    1   10.0  11.0
b  2.0    2   20.0  22.0
c  3.0    3   30.0  33.0
d  NaN    4    NaN   NaN


## Column Deletion

In [25]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
   'three' : pd.Series([10,20,30], index=['a','b','c'])}

df = pd.DataFrame(d)
print ("Our dataframe is:")
print (df)

# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print (df)

# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print (df)

Our dataframe is:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Deleting the first column using DEL function:
   two  three
a    1   10.0
b    2   20.0
c    3   30.0
d    4    NaN
Deleting another column using POP function:
   three
a   10.0
b   20.0
c   30.0
d    NaN


### Row Selection, Addition, and Deletion
###Selection by Label
*  Rows can be selected by passing row label to a loc function

In [26]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

print(df)

   three
a   10.0
b   20.0
c   30.0
d    NaN


In [27]:
df = pd.DataFrame(d)
print (df.loc['b'])
#The result is a series with labels as column names of the DataFrame. 
#And, the Name of the series is the label with which it is retrieved.

one    2.0
two    2.0
Name: b, dtype: float64


###Selection by integer location

In [28]:
df.iloc[2]

one    3.0
two    3.0
Name: c, dtype: float64

### Slice Rows
Multiple rows can be selected using **:** operator.

In [29]:
df[2:4]

Unnamed: 0,one,two
c,3.0,3
d,,4


### Addition of Rows 
*   Add new rows to a DataFrame using the append function. 
*   This function will append the rows at the end.

In [30]:
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
df

  df = df.append(df2)


Unnamed: 0,a,b
0,1,2
1,3,4
0,5,6
1,7,8


In [31]:
dict = {'Name':['Martha', 'Tim', 'Rob', 'Georgia'],
        'Maths':[87, 91, 97, 95],
        'Science':[83, 99, 84, 76]
       }
  
df1 = pd.DataFrame(dict)
display(df1)
  
dict = {'Name':['Amy', 'Maddy'],
        'Maths':[89, 90],
        'Science':[93, 81]
       }
  
df2 = pd.DataFrame(dict)
display(df2)
  
df3 = pd.concat([df1, df2], ignore_index = True)
df3.reset_index()
  
display(df3)

Unnamed: 0,Name,Maths,Science
0,Martha,87,83
1,Tim,91,99
2,Rob,97,84
3,Georgia,95,76


Unnamed: 0,Name,Maths,Science
0,Amy,89,93
1,Maddy,90,81


Unnamed: 0,Name,Maths,Science
0,Martha,87,83
1,Tim,91,99
2,Rob,97,84
3,Georgia,95,76
4,Amy,89,93
5,Maddy,90,81


### Deletion of Rows


In [32]:
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)

# Drop rows with label 0
df = df.drop(0)

df

  df = df.append(df2)


Unnamed: 0,a,b
1,3,4
1,7,8
