# Pandas
We will learn about the basics of Pandas, which is a powerful, open-source data analysis and manipulation library for Python built on top of the NumPy package. **Pandas** is derived from the term **panel data**.

In this session, we will study Pandas series and dataframes and some basic operations related to them

To get started using `pandas`, import it into your Python program as follows:

In [1]:
import pandas as pd

# Series
The Pandas `Series` object is a 1D labeled array capable of holding data of any type

### Example
Creating series

In [2]:
empty_series = pd.Series(data = None, dtype = 'int')

In [3]:
empty_series

Series([], dtype: int64)

In [4]:
type(empty_series)

pandas.core.series.Series

In [5]:
print(empty_series)

Series([], dtype: int64)


In [6]:
production_data = [1000, 1200, 1100, 950, 1050, 1430, 1760]
production_day = ['Day1', 'Day2', 'Day3', 'Day4', 'Day5', 'Day6', 'Day7']

In [7]:
daily_production = pd.Series(data = production_data, name = 'Production_Details')
daily_production

0    1000
1    1200
2    1100
3     950
4    1050
5    1430
6    1760
Name: Production_Details, dtype: int64

In [8]:
daily_production = pd.Series(data = production_data, index=production_day,name = 'Production_Details')
daily_production

Day1    1000
Day2    1200
Day3    1100
Day4     950
Day5    1050
Day6    1430
Day7    1760
Name: Production_Details, dtype: int64

In [9]:
daily_production.index

Index(['Day1', 'Day2', 'Day3', 'Day4', 'Day5', 'Day6', 'Day7'], dtype='object')

In [10]:
daily_production.values

array([1000, 1200, 1100,  950, 1050, 1430, 1760])

In [11]:
daily_production.dtype

dtype('int64')

In [12]:
daily_production.shape

(7,)

In [13]:
daily_production.ndim

1

In [14]:
daily_production.size

7

In [16]:
monthly_expenses = {'Jan': 2567.65,
                    'Feb': 2809.44,
                    'Mar': 3001.6,
                    'Apr': 2787.6,
                    'May': 3203.4}
expenses_series = pd.Series(data = monthly_expenses)
expenses_series

Jan    2567.65
Feb    2809.44
Mar    3001.60
Apr    2787.60
May    3203.40
dtype: float64

In [17]:
expenses_series_reseted=expenses_series.reset_index()
expenses_series_reseted
# index will become one of the column in df and new by default indexes will get assign to df

Unnamed: 0,index,0
0,Jan,2567.65
1,Feb,2809.44
2,Mar,3001.6
3,Apr,2787.6
4,May,3203.4


In [18]:
expenses_series_orig=expenses_series_reseted.set_index("index")
expenses_series_orig
# reverse of reset_index

Unnamed: 0_level_0,0
index,Unnamed: 1_level_1
Jan,2567.65
Feb,2809.44
Mar,3001.6
Apr,2787.6
May,3203.4


In [19]:
expenses_series_reseted.index=["A","B","C","D","E"]

In [20]:
expenses_series_reseted

Unnamed: 0,index,0
A,Jan,2567.65
B,Feb,2809.44
C,Mar,3001.6
D,Apr,2787.6
E,May,3203.4


In [21]:
import numpy as np

In [22]:
employee_names = np.array(['Aarti', 'David', 'Mahmood', 'Pranjal'])
employee_ID =  np.array([7634, 9008, 10115, 16643])

In [23]:
employee_series = pd.Series(data=employee_names, index=employee_ID)
employee_series

7634       Aarti
9008       David
10115    Mahmood
16643    Pranjal
dtype: object

In [24]:
daily_production

Day1    1000
Day2    1200
Day3    1100
Day4     950
Day5    1050
Day6    1430
Day7    1760
Name: Production_Details, dtype: int64

In [25]:
daily_production[daily_production<1250]

Day1    1000
Day2    1200
Day3    1100
Day4     950
Day5    1050
Name: Production_Details, dtype: int64

In [26]:
daily_production[daily_production>1250]

Day6    1430
Day7    1760
Name: Production_Details, dtype: int64

In [27]:
daily_production[(daily_production > 1000) & (daily_production < 1500)]

Day2    1200
Day3    1100
Day5    1050
Day6    1430
Name: Production_Details, dtype: int64

In [28]:
daily_production[(daily_production > 1000) | (daily_production < 1500)]

Day1    1000
Day2    1200
Day3    1100
Day4     950
Day5    1050
Day6    1430
Day7    1760
Name: Production_Details, dtype: int64

In [29]:
len(daily_production[(daily_production > 1000) | (daily_production < 1500)])

7

In [32]:
daily_production.max()

1760

In [33]:
daily_production.min()

950

In [34]:
daily_production.mean()

1212.857142857143

In [35]:
daily_production.median()

1100.0

In [36]:
daily_production.sum()

8490

In [37]:
daily_production.sort_values()
# by default sorting in increasing order

Day4     950
Day1    1000
Day5    1050
Day3    1100
Day2    1200
Day6    1430
Day7    1760
Name: Production_Details, dtype: int64

In [38]:
daily_production.sort_values(ascending = False)

Day7    1760
Day6    1430
Day2    1200
Day3    1100
Day5    1050
Day1    1000
Day4     950
Name: Production_Details, dtype: int64

In [39]:
series_1 = pd.Series(data = [287, 343, 187], index = ['Asif', 'Bhairavi', 'Chad'], name = 'Monday')
series_2 = pd.Series(data = [186, 524, 202], index = ['Asif', 'Bhairavi', 'Chad'], name = 'Tuesday')

In [40]:
series_1

Asif        287
Bhairavi    343
Chad        187
Name: Monday, dtype: int64

In [41]:
series_2

Asif        186
Bhairavi    524
Chad        202
Name: Tuesday, dtype: int64

In [42]:
series_1 + series_2

Asif        473
Bhairavi    867
Chad        389
dtype: int64

In [43]:
series_1.add(series_2)

Asif        473
Bhairavi    867
Chad        389
dtype: int64

### Quiz
Consider the series shown below:
```
cust_names = ['Chad', 'Farheen', 'Himadri', 'Monisha']
cust_bill = [256.78, 434.53, 109.25, 529.42]
cust_info = pd.Series(cust_bill, cust_names)
```
Use series operations to find out which customer has spent the maximum amount of money

In [48]:
cust_names = ['Chad', 'Farheen', 'Himadri', 'Monisha']
cust_bill = [256.78, 434.53, 109.25, 529.42]
cust_info = pd.Series(cust_bill, cust_names)
cust_info.idxmax()

'Monisha'

In [49]:
cust_info.idxmin()

'Himadri'

# Dataframes
A Dataframe is a 2D table made up of multiple series. We will study dataframes in detail. Learners are expected to work with series on their own later as most of the methods and operations that are associated with dataframes are also applicable to series.

### Example
Creating dataframes

In [50]:
df = pd.DataFrame()

In [51]:
df

In [52]:
type(df)

pandas.core.frame.DataFrame

In [53]:
cust_data = [['001', 'Alice', 'alice@example.com', '123-456-7890', 20],
             ['002', 'Bhooshan', 'bob@example.com', '987-654-3210', 15],
             ['003', 'Carl', 'charlie@example.com', '555-123-4567', 25]]

cust_cols = ['CustomerID', 'Name', 'Email', 'Phone', 'TotalPurchases']

df = pd.DataFrame(data = cust_data, columns = cust_cols)

In [54]:
df

Unnamed: 0,CustomerID,Name,Email,Phone,TotalPurchases
0,1,Alice,alice@example.com,123-456-7890,20
1,2,Bhooshan,bob@example.com,987-654-3210,15
2,3,Carl,charlie@example.com,555-123-4567,25


In [55]:
df.shape

(3, 5)

In [56]:
df.ndim

2

In [57]:
cust_data = [['001', 'Alice', 'alice@example.com', '123-456-7890', 20],
             ['002', 'Bhooshan', 'bob@example.com', '987-654-3210', 15],
             ['003', 'Carl', 'charlie@example.com', '555-123-4567', 25]]

cust_cols = ['CustomerID', 'Name', 'Email', 'Phone', 'TotalPurchases']

cust_ind = ['A', 'B', 'C']

df = pd.DataFrame(data = cust_data, columns = cust_cols, index = cust_ind)

In [58]:
df

Unnamed: 0,CustomerID,Name,Email,Phone,TotalPurchases
A,1,Alice,alice@example.com,123-456-7890,20
B,2,Bhooshan,bob@example.com,987-654-3210,15
C,3,Carl,charlie@example.com,555-123-4567,25


In [59]:
cust_data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank'],
             'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'david@example.com', 'eva@example.com', 'frank@example.com'],
             'Phone':  ['123-456-7890', '987-654-3210', '555-123-4567', '111-222-3333', '444-555-6666', '777-888-9999'],
             'TotalPurchases':  [20, 15, 25, 10, 30, 22]}

cust_ind = ['001', '002', '003', '004', '005', '006']

df = pd.DataFrame(data = cust_data, index = cust_ind)

In [60]:
df

Unnamed: 0,Name,Email,Phone,TotalPurchases
1,Alice,alice@example.com,123-456-7890,20
2,Bob,bob@example.com,987-654-3210,15
3,Charlie,charlie@example.com,555-123-4567,25
4,David,david@example.com,111-222-3333,10
5,Eva,eva@example.com,444-555-6666,30
6,Frank,frank@example.com,777-888-9999,22


### Example
Reading data into dataframes

In [61]:
df = pd.read_csv('Buffet_Details.csv')

In [62]:
df

Unnamed: 0,Room,Name,Age,Cuisine,Expenditure
0,A1,Shilpa,18,Indian,24.65
1,A2,Jaspreet,32,American,18.54
2,A3,Dominic,19,Indian,27.66
3,A4,Ahmad,22,American,19.54
4,A5,Joseph,28,Indian,17.32
5,A6,Saju,31,Indian,12.56
6,A7,Monica,48,Chinese,11.09
7,A8,Preeti,67,American,12.23
8,A9,Emma,24,Indian,18.88
9,A10,Gaurav,29,Chinese,14.43
