In this section of the course we will learn how to use pandas for data analysis

* Introduction to pandas 

* Series 

* DataFrames 

* Missing Data 

* Group By 

* Merging, Joining, Concatenation 

* Operations in pandas 

* Data input and Output 


Series

The first main data type we will learn about for pandas is the Series data type. Let's import Pandas and explore the Series object.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's explore this concept through some examples:

In [7]:
# first we will need to import lib 

import numpy as np 
import pandas as pd 






In [12]:
labels=['a','b','c','d']
my_list = [10,20,30,40]

arr = np.array([10,20,30,40])

d={'a':10,'b':20,'c':30,'d':40}

In [15]:

# Using the lists 


s=pd.Series(my_list)

In [16]:
s

0    10
1    20
2    30
3    40
dtype: int64

In [17]:
pd.Series(data=my_list)

0    10
1    20
2    30
3    40
dtype: int64

In [18]:
# indexing the Labels

pd.Series(data=my_list,index=labels)

a    10
b    20
c    30
d    40
dtype: int64

In [19]:
pd.Series(my_list,labels)

a    10
b    20
c    30
d    40
dtype: int64

** NumPy Arrays **

In [20]:
pd.Series(arr)

0    10
1    20
2    30
3    40
dtype: int32

In [21]:
pd.Series(data=arr, index=labels)

a    10
b    20
c    30
d    40
dtype: int32

In [22]:
pd.Series(arr, labels)

a    10
b    20
c    30
d    40
dtype: int32

** Dictionary**

In [23]:
d

{'a': 10, 'b': 20, 'c': 30, 'd': 40}

In [26]:
pd.Series(d)

a    10
b    20
c    30
d    40
dtype: int64

### Data in a Series

A pandas Series can hold a variety of object types:

In [30]:
pd.Series(data=labels)

0    a
1    b
2    c
3    d
dtype: object

In [36]:
# Even functions (although unlikely that you will use this)

pd.Series([sum, print])

0      <built-in function sum>
1    <built-in function print>
dtype: object

## Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:

In [37]:

import numpy as np
import pandas as pd

# syntax
# S=pd.Series(data, index=[index])
anyname=pd.Series(list('abcdef'))
print(anyname)


0    a
1    b
2    c
3    d
4    e
5    f
dtype: object


In [38]:
np_countries=np.array(['lohit','susumu','Yuri','kota','Naoya'])

In [42]:
s_country=pd.Series(np_countries)
s_country

0     lohit
1    susumu
2      Yuri
3      kota
4     Naoya
dtype: object

In [44]:
Ser1=pd.Series([1,2,3,4], index=['USA','Germany','USSR','Japan'])
Ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [45]:
Ser2=pd.Series([1,2,3,4],index=['USA','Germany','Italy', 'Japan'])
Ser2

USA        1
Germany    2
Italy      3
Japan      4
dtype: int64

In [48]:
Ser1['USA'] 

1

Operations are then also done based off of index:

In [49]:
Ser1+Ser2

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

In [50]:

cal=pd.Series([1,2,3,4,5], index=['Lohit','mohot','loadja','shia','sjaa'])
cal2=pd.Series (['Lohit','mohit','loadja','shia','sjaa'],index=[1,2,3,4,5])
cal+cal2

1         NaN
2         NaN
3         NaN
4         NaN
5         NaN
Lohit     NaN
loadja    NaN
mohot     NaN
shia      NaN
sjaa      NaN
dtype: object

In [53]:
scalar_series=pd.Series(5, index=['a','b','c','d'])

In [54]:
# if i input scalar value of 5 ill get o/p as 5.0 for all abcde

scalar_series

a    5
b    5
c    5
d    5
dtype: int64

In [57]:

name=pd.Series([1,2,3,4,5],  index=['Lohit','mohot','loadja','shia','sjaa'])
name

Lohit     1
mohot     2
loadja    3
shia      4
sjaa      5
dtype: int64

In [58]:
# selecting
name['loadja']

3

In [59]:
# slicing 

name[0:2]

Lohit    1
mohot    2
dtype: int64

In [63]:
# i can locate the index value by value of loadja
print('the value of the loadja is',name.loc['loadja'])

the value of the loadja is 3


In [64]:
name.loc['shia']

4

In [65]:
name.iloc[3]

4

In [67]:
first=pd.Series([1,2,3,4], index=['a','b','c','d'])
second=pd.Series([10,20,30,40], index=['a','b', 'c', 'd'])

print(first+second)

a    11
b    22
c    33
d    44
dtype: int64


In [None]:
first=pd.Series([10,20,30,40])
second=pd.Series([10,20,30,40], index=['a','b', 'c', 'd'])

print(first+second)

In [70]:
first=pd.Series([1,2,3,4], index=['a','b','c','d'])
second=pd.Series([10,20,30,40], index=['a','d', 'c', 'b'])

print(first+second)

a    11
b    42
c    33
d    24
dtype: int64


In [74]:
olympic_data={'Host_city':['London', 'Beijing', 'Athens','Sedney','Atlanta'],
             'Year':[2012,2008,2004,2000,1996],
              'No. of Participating Countries':[205,204,201,200,197]
             }

store_here=pd.DataFrame(olympic_data)
print(store_here)

  Host_city  Year  No. of Participating Countries
0    London  2012                             205
1   Beijing  2008                             204
2    Athens  2004                             201
3    Sedney  2000                             200
4   Atlanta  1996                             197
