# Pandas
* Derived from econometrics term Panel Data.
* Package for data manipulation and analysis.
* Includes 2 additional data structures into Python
  * Pandas Series 
  * Pandas DataFrame
* A Panda series is a One-Dimensional array like object that can hold many data types such as numbers or strings. 
* Each element in a series can be index label.
* Pandas series are also mutable like Numpy.
* Some of the features includes:
  * Allows the use of labels for rows and columns
  * Can calculate rolling stastics on time series data
  * Easy handling of NaN values
  * Is able to load data of different formats into DataFrames.
  * Can join and merge different datasets together
  * It integrates with NumPy and Matplotlib
  



## Pandas Series

In [None]:
import pandas as pd

groc = pd.Series(data = [30,50, 'Yes', 'No'], index = ['wheat', 'mangos', 'candy', 'medicine'])

print(groc)
print()
print('Gorc has shape: ', groc.shape)
print('Groc has dimension: ', groc.ndim)
print('Groc has a total elements = ', groc.size)

print()
print('The data in groc is: ', groc.values)
print('The index of groc is: ', groc.index)




wheat        30
mangos       50
candy       Yes
medicine     No
dtype: object

Gorc has shape:  (4,)
Groc has dimension:  1
Groc has a total elements =  4

The data in groc is:  [30 50 'Yes' 'No']
The index of groc is:  Index(['wheat', 'mangos', 'candy', 'medicine'], dtype='object')


### Checking indexe label exist

In [None]:
print('Is mangos an index label in Groc : ', 'mangos' in groc)

y = 'cough-syrup' in groc
print('Is cough-syrup an index label in Groc', y)

Is mangos an index label in Groc :  True
Is cough-syrup an index label in Groc False


### Accessing elements in Pandas
* Elements can be accessed using INDEX LABEL or NUMERICAL INDICES inside square brackets[].
* Numerical Indices are similar approach as in NumPy. Both Postive or Negative indices can be used to access the elements.
* '.loc' attribute used explicitly for labeled index.
* '.iloc' attribute used for numerical index.




In [None]:
print("Value for 'Wheat' is : ", groc['wheat'])
print()
print("Value for 'Wheat' is : ", groc.loc['wheat'])
print()
print("Values for 'Wheat' and 'Mangos' are : ", groc[['wheat', 'mangos']])
print()
print("Values for 'Wheat' and 'Mangos' are : ", groc.loc[['wheat', 'mangos']])
print()
print("Values for 'Wheat' and 'Mangos' are : ", groc[[0,1]])
print()
print("Values for 'Wheat' and 'Mangos' are : ", groc.iloc[[0,1]])
print()
print("Values for 'Wheat' and 'Mangos' are : ", groc[[-4,-3]])

Value for 'Wheat' is :  30

Value for 'Wheat' is :  30

Values for 'Wheat' and 'Mangos' are :  wheat     30
mangos    50
dtype: object

Values for 'Wheat' and 'Mangos' are :  wheat     30
mangos    50
dtype: object

Values for 'Wheat' and 'Mangos' are :  wheat     30
mangos    50
dtype: object

Values for 'Wheat' and 'Mangos' are :  wheat     30
mangos    50
dtype: object

Values for 'Wheat' and 'Mangos' are :  wheat     30
mangos    50
dtype: object


### Modifying/Adding elements

In [None]:
print("Original groc list: \n", groc)
print()
groc['wheat'] = 27 # Modified

groc['grains'] = 110 # Adding the elements
print()
print('Modified groc list is : \n', groc)

Original groc list: 
 wheat        27
mangos       50
candy       Yes
medicine     No
dtype: object


Modified groc list is : 
 wheat        27
mangos       50
candy       Yes
medicine     No
grains      110
dtype: object


### Deleting elements in Pandas
* .drop() method is used to drop the elements from the list. But it keeps the original list unchanged.
* inplace = True, attribute in .drop() method deletes the element from the list.

In [None]:
print('Original Groc list is : \n', groc)
print()
new_groc = groc.drop(['mangos', 'candy'])
print('New groc lis = \n', new_groc)
print()
print('Original groc list is : \n', groc)
print()
new_groc['rice'] = 45
print('New groc list is : \n', new_groc)
print()
print('Original groc list : \n', groc)
print()
new_groc.drop('rice', inplace=True)
print('Deleting elements from new groc list : \n', new_groc)

Original Groc list is : 
 wheat        27
mangos       50
candy       Yes
medicine     No
dtype: object

New groc lis = 
 wheat       27
medicine    No
dtype: object

Original groc list is : 
 wheat        27
mangos       50
candy       Yes
medicine     No
dtype: object

New groc list is : 
 wheat       27
medicine    No
rice        45
dtype: object

Original groc list : 
 wheat        27
mangos       50
candy       Yes
medicine     No
dtype: object

Deleting elements from new groc list : 
 wheat       27
medicine    No
dtype: object


### Arithmetic operation

In [None]:
fruits= pd.Series(data = [10, 6, 3,], index = ['apples', 'oranges', 'bananas'])
print('')




In [None]:
print('Original Inventory \n', fruits)
print()
print('Increasing inventory by 2 \n', fruits + 2)
print()
print('Increasing inventory of {} to {}'.format(fruits[1], [fruits['oranges']+5]))

Original Inventory 
 apples     10
oranges     6
bananas     3
dtype: int64

Increasing inventory by 2 
 apples     12
oranges     8
bananas     5
dtype: int64

Increasing inventory of 6 to [11]


## Pandas DataFrames

* Pandas DataFrames are two-dimensional data structures with labeled rows and columns, that can hold many data types.
* Pandas DataFrames can be created 'Manually' or by 'loading data from a file'.
* Rows can be accessed using the keyword 'index'.
* Columns can be accessed using the keyword 'columns'.

* Manually creating Pandas DataFrames includes:
  * Creating Dictionaries using Panda Series OR Dictionary of lists OR Python Dictionary.
  * Pass the Dictionary to create a DataFrame using pd.DataFrame() function.
  * DataFrame_Name.index will give index labels (rows label info).
  * DataFrame_Name.columns will give Column labels info.
  * DataFrame_Name.values will give values info.
* Partial info related to specific column or index can be access by passing columns=[] or index =[] as the additional argument to pd.DataFrame() functions.
* Particular Row or Column having no value is represented by NaN (Not a Number).



### DataFrames creation using Pandas Series

In [None]:
import pandas as pd

items = {'Sam' : pd.Series(data = [434, 123, 345], index = ['Jacket', 'Socks', 'Belt']),
          'John' : pd.Series(data = [300, 125, 455, 1200], index = ['Shirt', 'Socks', 'Jacket','Shoes'])}

cart = pd.DataFrame(items)

print(cart)


          Sam    John
Belt    345.0     NaN
Jacket  434.0   455.0
Shirt     NaN   300.0
Shoes     NaN  1200.0
Socks   123.0   125.0


In [None]:
print('Cart has shape :', cart.shape)
print()
print('Cart has dimension : ', cart.ndim)
print()
print('Cart has a total elements :', cart.size)
print()
print('The data in cart are :\n', cart.values)
print()
print('The row index in the cart is: ', cart.index)
print()
print('The column index in the cart is: ', cart.columns)
print()


Cart has shape : (5, 2)

Cart has dimension :  2

Cart has a total elements : 10

The data in cart are :
 [[ 345.   nan]
 [ 434.  455.]
 [  nan  300.]
 [  nan 1200.]
 [ 123.  125.]]

The row index in the cart is:  Index(['Belt', 'Jacket', 'Shirt', 'Shoes', 'Socks'], dtype='object')

The column index in the cart is:  Index(['Sam', 'John'], dtype='object')



#### Passing Subset of data (partial information):
 

In [None]:
# Creating DataFrames using column subset
sam_cart = pd.DataFrame(items, columns = ['Sam'])

print(sam_cart)
print()

sam_cart1 = pd.DataFrame(items, columns = ['Sam'], index=['Belt','Jacket','Shoes','Shirt','Shirt'])
print(sam_cart1)

        Sam
Jacket  434
Socks   123
Belt    345

          Sam
Belt    345.0
Jacket  434.0
Shoes     NaN
Shirt     NaN
Shirt     NaN


#### DataFrame creation using Dictionary of List
* Process is same for creating Data Frames using dictionary of list (array).
* *Please note ==> All the lists(array) must be of the same length.*
* Pandas automatically use numerical row indexes (starting 0) if label is not provided.
* Label for index can be provided using additional attribute index in pd.DataFrame()

In [None]:
data = {
    'Integers' : [1,2,3],
    'Floats' : [4.5,6.6,8.4]
}

print(data)
print()

df = pd.DataFrame(data)
print(df)
print()
df = pd.DataFrame(data, index = ['label 1', 'label 2', 'label 3'])
print(df)

{'Integers': [1, 2, 3], 'Floats': [4.5, 6.6, 8.4]}

   Integers  Floats
0         1     4.5
1         2     6.6
2         3     8.4

         Integers  Floats
label 1         1     4.5
label 2         2     6.6
label 3         3     8.4


#### DataFrame creation using List of Python Dictionary

In [None]:
import pandas as pd

items2 = [
          {'bike':20, 'pants':30, 'watches':35},
          {'watches' : 10, 'glass':50, 'bikes':15, 'pants':5}                    
]

store_items = pd.DataFrame(items2)

print(store_items)
print()
store_items = pd.DataFrame(items2, index = ['store 1','store 2'])
print(store_items)

   bike  pants  watches  glass  bikes
0  20.0     30       35    NaN    NaN
1   NaN      5       10   50.0   15.0

         bike  pants  watches  glass  bikes
store 1  20.0     30       35    NaN    NaN
store 2   NaN      5       10   50.0   15.0


Accessing Elements in Pandas DataFrames
* Data Elements can be accessed using dataframe[column][row]

In [None]:
print(store_items)

print('How many bikes in each store : ', store_items[['bike']])
print()
print('How many bikes in store 1 : ', store_items['bike']['store 1'])
print()
print('How many bike and pants are in each store : \n', store_items[['bike','pants']])
print()
print('How many watches in store 2: \n', store_items['watches']['store 2'])


         bike  pants  watches  glass  bikes
store 1  20.0     30       35    NaN    NaN
store 2   NaN      5       10   50.0   15.0
How many bikes in each store :           bike
store 1  20.0
store 2   NaN

How many bikes in store 1 :  20.0

How many bike and pants are in each store : 
          bike  pants
store 1  20.0     30
store 2   NaN      5

How many watches in store 2: 
 10


Adding elements to the dataframe

In [None]:
store_items['shirts'] = [20,5]
print(store_items)

         bike  pants  watches  glass  bikes  shirts
store 1  20.0     30       35    NaN    NaN      20
store 2   NaN      5       10   50.0   15.0       5


In [None]:
new_items = [{'bikes':20, 'pants':30, 'watches':50}]

new_store = pd.DataFrame(new_items, index = ['srote 3'])

print(new_items)

[{'bikes': 20, 'pants': 30, 'watches': 50}]


Append method to add new row dataframe

In [None]:
print('Original store items :\n', store_items)
print()
print()
store_items = store_items.append(new_store)
print('Data append at the end')
print(store_items)


Original store items :
          bike  pants  watches  glass  bikes  shirts
store 1  20.0     30       35    NaN    NaN      20
store 2   NaN      5       10   50.0   15.0       5


Data append at the end
         bike  pants  watches  glass  bikes  shirts
store 1  20.0     30       35    NaN    NaN    20.0
store 2   NaN      5       10   50.0   15.0     5.0
srote 3   NaN     30       50    NaN   20.0     NaN


Adding new column at the END using existing data in DataFrame

In [None]:
store_items['new watches'] = store_items['watches'][1:]
print(store_items)

         pants  watches  shoes  glass  bikes  shirts  new watches
store 1     30       35      8    NaN    NaN    20.0          NaN
store 2      5       10      5   50.0   15.0     5.0         10.0
srote 3     30       50      6    NaN   20.0     NaN         50.0


Adding new column into DataFrame anywhere using insert function

In [None]:
store_items.insert(3, 'shoes', [8,5,6])
print(store_items)

ValueError: ignored

Deleting Elements in DataFrame.
* .pop() method only allows to delete Columns
* .drop() method can be used to delete both rows and columns by use of axis keyword.

In [None]:
print(store_items)
print()
print(store_items.pop('watches'))
print()
print(store_items)

         pants  watches  shoes  glass  bikes  shirts  new watches
store 1     30       35      8    NaN    NaN    20.0          NaN
store 2      5       10      5   50.0   15.0     5.0         10.0
srote 3     30       50      6    NaN   20.0     NaN         50.0

store 1    35
store 2    10
srote 3    50
Name: watches, dtype: int64

         pants  shoes  glass  bikes  shirts  new watches
store 1     30      8    NaN    NaN    20.0          NaN
store 2      5      5   50.0   15.0     5.0         10.0
srote 3     30      6    NaN   20.0     NaN         50.0


In [None]:
print(store_items)

         bike  pants  watches  glass  bikes
store 1  20.0     30       35    NaN    NaN
store 2   NaN      5       10   50.0   15.0


In [None]:
store_items = store_items.drop(['store 1'], axis = 0)
print(store_items)

         bike  pants  watches  glass  bikes
store 2   NaN      5       10   50.0   15.0


Renaming row or column labels
dataframe_name.rename()

In [None]:
import pandas as pd

items2 = [
          {'bike':20, 'pants':30, 'watches':35},
          {'watches' : 10, 'glass':50, 'bikes':15, 'pants':5}                    
]

store_items = pd.DataFrame(items2)

print(store_items)
print()
store_items = pd.DataFrame(items2, index = ['store 1','store 2'])
print(store_items)

   bike  pants  watches  glass  bikes
0  20.0     30       35    NaN    NaN
1   NaN      5       10   50.0   15.0

         bike  pants  watches  glass  bikes
store 1  20.0     30       35    NaN    NaN
store 2   NaN      5       10   50.0   15.0


In [None]:
store_items = store_items.rename(columns={'bikes' : 'hats'})
print(store_items)


         bike  pants  watches  glass  hats
store 1  20.0     30       35    NaN   NaN
store 2   NaN      5       10   50.0  15.0


In [None]:
store_items = store_items.rename(index = {'store 2' : 'last store'})
print(store_items)

            bike  pants  watches  glass  hats
store 1     20.0     30       35    NaN   NaN
last store   NaN      5       10   50.0  15.0


Making column to a row index

In [None]:
store_items = store_items.set_index('hats')
print(store_items)


      bike  pants  watches  glass
hats                             
NaN   20.0     30       35    NaN
15.0   NaN      5       10   50.0


## NaN Values


In [None]:
items2 = [ {'bikes' : 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes': 8, 'suits': 45},
         {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants': 5, 'shirts': 2, 'shoes': 5, 'suits': 7},
         {'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes': 10}         
]

store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])

print(store_items)

print()

Total_items = store_items.size
print('Total items in the store are ', Total_items)



         bikes  pants  watches  shirts  shoes  suits  glasses
store 1     20     30       35    15.0      8   45.0      NaN
store 2     15      5       10     2.0      5    7.0     50.0
store 3     20     30       35     NaN     10    NaN      4.0

Total items in the store are  21


In [None]:
x = store_items.isnull().sum().sum()
print(x)

3


In [None]:
y = store_items.count().sum()
print('Non NaN values are ', y)

print()
print('Total NaN counts are ', (Total_items - y))

Non NaN values are  18

Total NaN counts are  3


Eliminate row or columns having NaN values.
* DataFrame_name.dropna(axis)
* To remove the elements having NaN, make 'inplace = True'

In [None]:
import pandas as pd
items2 = [ {'bikes' : 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes': 8, 'suits': 45},
         {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants': 5, 'shirts': 2, 'shoes': 5, 'suits': 7},
         {'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes': 10}         
]

store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])

print(store_items)

print()

Total_items = store_items.size
print('Total items in the store are ', Total_items)

         bikes  pants  watches  shirts  shoes  suits  glasses
store 1     20     30       35    15.0      8   45.0      NaN
store 2     15      5       10     2.0      5    7.0     50.0
store 3     20     30       35     NaN     10    NaN      4.0

Total items in the store are  21


In [None]:
store_items.dropna(axis = 0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 2,15,5,10,2.0,5,7.0,50.0


In [None]:
store_items.dropna(axis = 1)

Unnamed: 0,bikes,pants,watches,shoes
store 1,20,30,35,8
store 2,15,5,10,5
store 3,20,30,35,10


Replacing NaN values
* .fillna() method
* Using Forward or Back filling method
* .interpolate(method ='', axis = )



In [None]:
store_items.fillna(0, inplace=True)
print(store_items)

         bikes  pants  watches  shirts  shoes  suits  glasses
store 1     20     30       35    15.0      8   45.0      0.0
store 2     15      5       10     2.0      5    7.0     50.0
store 3     20     30       35     0.0     10    0.0      4.0


In [None]:
store_items.fillna(method = 'ffill', axis = 0, inplace = True)
print(store_items)

         bikes  pants  watches  shirts  shoes  suits  glasses
store 1     20     30       35    15.0      8   45.0      NaN
store 2     15      5       10     2.0      5    7.0     50.0
store 3     20     30       35     2.0     10    7.0      4.0


Book Rating Exercise


In [None]:
import pandas as pd
import numpy as np

# Since we will be working with ratings, we will set the precision of our 
# dataframes to one decimal place.
pd.set_option('precision', 1)

# Create a Pandas DataFrame that contains the ratings some users have given to a
# series of books. The ratings given are in the range from 1 to 5, with 5 being
# the best score. The names of the books, the authors, and the ratings of each user
# are given below:

books = pd.Series(data = ['Great Expectations', 'Of Mice and Men', 'Romeo and Juliet', 'The Time Machine', 'Alice in Wonderland' ])
authors = pd.Series(data = ['Charles Dickens', 'John Steinbeck', 'William Shakespeare', ' H. G. Wells', 'Lewis Carroll' ])

user_1 = pd.Series(data = [3.2, np.nan ,2.5])
user_2 = pd.Series(data = [5., 1.3, 4.0, 3.8])
user_3 = pd.Series(data = [2.0, 2.3, np.nan, 4])
user_4 = pd.Series(data = [4, 3.5, 4, 5, 4.2])

# Users that have np.nan values means that the user has not yet rated that book.
# Use the data above to create a Pandas DataFrame that has the following column
# labels: 'Author', 'Book Title', 'User 1', 'User 2', 'User 3', 'User 4'. Let Pandas
# automatically assign numerical row indices to the DataFrame. 

# Create a dictionary with the data given above
dat = 

# Use the dictionary to create a Pandas DataFrame
book_ratings = 

# If you created the dictionary correctly you should have a Pandas DataFrame
# that has column labels: 'Author', 'Book Title', 'User 1', 'User 2', 'User 3',
# 'User 4' and row indices 0 through 4.

# Now replace all the NaN values in your DataFrame with the average rating in
# each column. Replace the NaN values in place. HINT: you can use the fillna()
# function with the keyword inplace = True, to do this. Write your code below:


## Loading Data into a Pandas DataFrame
* CSV is a very common format. 
* .read_csv() method can be used.
* DataFrame_Name.head() : to look at the first 5 rows of data.

