# EDA Lab


## Pandas

* *pandas* is a Python library for data analysis. 
* It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python. 

* *pandas* build upon *numpy* providing easy-to-use data structures and data manipulation functions with integrated indexing.

* The main data structures *pandas* provides are *Series* and *DataFrames*. 

Let's get started 

### Import Pandas and Numpy Libraries

In [44]:
import pandas as pd
import numpy as np

In [45]:
import sys
sys.path

['/Users/prakashkumarsingh/Desktop/sem4/eda',
 '/Library/Frameworks/Python.framework/Versions/3.10/lib/python310.zip',
 '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10',
 '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/lib-dynload',
 '',
 '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages']

### Introduction to pandas Data Structures

* *pandas* has two main data structures it uses, namely, *Series* and *DataFrames*. 

### pandas Series
- one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). 
- The axis labels are collectively referred to as the index. 

### Creating a Series by passing a list of values, letting pandas create a default integer index

In [46]:
data = np.array([1, 3, 5, np.nan, 6, 8])
s = pd.Series(data)
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [47]:
array=np.array([1,3,5,6,7])
p=pd.Series(array)

In [48]:
type(s)

pandas.core.series.Series

In [49]:
p.index
p

0    1
1    3
2    5
3    6
4    7
dtype: int64

In [50]:
print(p[3:5])

3    6
4    7
dtype: int64


### Creating a series with 'object' data type index

In [51]:
ser = pd.Series([100, 'foo', 300, 'bar', 500], index=['tom', 'bob', 'nancy', 'dan', 'eric'])

In [52]:
i=np.arange(1,6,1)
ser = pd.Series([100, 'foo', 300, 'bar', 500], index=i)
print(ser)

1    100
2    foo
3    300
4    bar
5    500
dtype: object


In [53]:
ser.index

Int64Index([1, 2, 3, 4, 5], dtype='int64')

In [54]:
ser.loc[[1,5]]                #label-location based indexer for selection by label

1    100
5    500
dtype: object

In [55]:
ser[[4, 3, 1]]

4    bar
3    300
1    100
dtype: object

In [56]:
ser.iloc[2]                            #integer-location based indexing for selection by position

300

In [57]:
'bob' in ser

False

In [58]:
ser

1    100
2    foo
3    300
4    bar
5    500
dtype: object

In [59]:
ser * 2

1       200
2    foofoo
3       600
4    barbar
5      1000
dtype: object

In [60]:
ser[['nancy', 'eric']] ** 2

KeyError: "None of [Index(['nancy', 'eric'], dtype='object')] are in the [index]"

### pandas DataFrame
- *pandas DataFrame* is a 2-dimensional labeled data structure with columns of potentially different types. 
- You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

### Create DataFrame from dictionary of Python Series

In [None]:
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
print(d)

In [None]:
df = pd.DataFrame(d)
df

In [None]:
df.index

In [None]:
df.columns

In [None]:
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])

In [None]:
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])

### Create DataFrame from list of Python dictionaries

In [None]:
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
data

In [None]:
pd.DataFrame(data)

In [None]:
pd.DataFrame(data, index=['orange', 'red'])

In [None]:
pd.DataFrame(data, columns=['joe', 'dora','alice'])

### Basic DataFrame operations

In [None]:
df

In [None]:
df['one']

In [None]:
df['three'] = df['one'] * df['two']
df

In [None]:
df['flag'] = df['one'] > 250
df

In [None]:
three = df.pop('three')

In [None]:
three

In [None]:
df

In [None]:
del df['two']

In [None]:
df

In [None]:
df.insert(2, 'copy_of_one', df['one'])
df

In [None]:
df['one_upper_half'] = df['one'][:2]
df

### Example for creating DataFrame from list 

In [None]:
columns = ['name', 'age', 'gender', 'job']

customer1 = pd.DataFrame([['mary', 19, "F", "student"],
                          ['akbar', 26, "M", "student"]],
                         columns=columns)

customer2 = pd.DataFrame([['amar', 22, "M", "student"],
                          ['alice', 58, "F", "manager"]],
                         columns=columns)

customer3 = pd.DataFrame(dict(name=['dinesh', 'julie'],
                              age=[33, 44], gender=['M', 'F'],
                              job=['engineer', 'scientist']))

customer3

#### combing DataFrames using concat

In [None]:
#combing DataFrames using concat
customer1.append(customer2)
customers = pd.concat([customer1, customer2, customer3])
customers

#### join DataFrame using merge

In [None]:
# join DataFrame
customer4 = pd.DataFrame(dict(name=['mary', 'akbar', 'amar', 'julie'],
                          height=[165, 180, 175, 171]))
customer4

In [None]:
# Use intersection of keys from both (i.e., customers and customer4) DataFrames

#method1
merge_inter =customers.merge(customer4,how='inner')

#method2
#merge_inter=pd.merge(customers, customer4, on='name', how='inner')

merge_inter

In [None]:
# Use union of keys from both DataFrames
customers = customers.merge(customer4,how='outer')
customers

#### Summarizing

In [None]:
#Summarizing
customers                    # print the first 30 and last 30 rows


In [None]:
type(customers)              # DataFrame


In [None]:
customers.head(5)        # print the first 5 rows


In [None]:
#CODE HERE  
customers.tail(5)  

# print the last 5 rows


In [None]:
customers.index    # "the index" (aka "the labels")


In [None]:
customers.columns# column names (which is "an index")


In [None]:
customers.dtypes     # data types of each column


In [None]:
customers.shape # number of rows and columns


In [None]:
customers.values     # underlying numpy array


In [None]:
customers.info()             # concise summary 


#### Columns selection

In [None]:
# Columns selection
customers['gender']                 # select one column
print(type(customers['gender']))    # Series
customers.gender                    # select one column using the DataFrame


In [None]:
# select multiple columns
my_cols = ['age','gender']       # select two columns i.e., age and genderor by creating a list...
print(customers[my_cols])           # ...and use that list to select columns
print(type(customers[my_cols]))     # DataFrame


#### Rows selection

In [None]:
# iloc is strictly integer position based
df = customers.copy()
df.iloc[0]                            # first row
df.iloc[0, 0]                         # first item of first row
df.iloc[0, 0] = 55

for i in range(customers.shape[0]):
    row = df.iloc[i]
    row.age *= 100                    # setting a copy, and not the original frame data.

df                             # df is not modified

In [None]:
# ix supports mixed integer and label based access.
df = customers.copy()
df.loc[0]         # first row
df.loc[0, "age"]  # first item of first row
df.loc[0, "age"] = 55

for i in range(df.shape[0]): 
    df.loc[i, "age"] *= 10

df         # df is modified

#### Row selection with simple logical filtering

In [None]:
customers[customers['age']<20]['name']# only show users with age < 20
#customers[customers.age<20] 

#### Row selection with advanced logical filtering

In [None]:
customers[customers['age']<20][['age','job']]# select multiple columns i.e., age and job with age < 20


In [None]:
customers[(customers.age > 20) & (customers.gender == 'M')]   # use multiple conditions


In [None]:
customers[customers.job.isin(['student','engineer'])]  # filter specific values i.e., job as student or engineer

In [None]:
df

#### Descriptive statistics

In [None]:
df.describe()                #Summarize all numeric columns

In [None]:
df.describe(include='all')    #Summarize all columns

In [None]:
df.describe(include=['object'])         #limit to objects

#### Statistics per group (groupby)

In [None]:
print(df.groupby("job").mean())

In [None]:
print(df.groupby("job").age.mean())  # groupby job with age statistics

#df.groupby("job").age.mean()

#method2
#print(df.groupby("job")['age'].mean())

## Reference on Pandas for more details
* *pandas* Documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html
* *Python for Data Analysis* by Wes McKinney
* *Python Data Science Handbook* by Jake VanderPlas

