## Introduction to Pandas
## Date: 25/1/22

Pandas has two main data structures, namely series and dataframes
### Understanding Pandas Series

In [1]:
import numpy as np
import pandas as pd

In [4]:
#series data structure
age = pd.Series([20, 23, 43, 21, 27, 25, 26])
age

0    20
1    23
2    43
3    21
4    27
5    25
6    26
dtype: int64

In [35]:
# you can perform boolean (mask) operations just like in an array or list. 
age < 25
age[age<25]
age[age < age.mean()]

chris        20
sam          23
zelda        21
ghosling     25
hemsworth    26
dtype: int64

In [20]:
#Exploring the pandas series object
print(type(age))
print(age.dtype)
print(age.values)
print(age.mean())

<class 'pandas.core.series.Series'>
int64
[20 23 43 21 27 25 26]
26.428571428571427


### Difference between numpy array and pandas series:
- The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.
- Series is different from lists in the fact that all elements of the series will be of the same type. However, in a list this is not the case.

In [14]:
# numpy array
a = np.array(range(0, 5))
print('This is a numpy array ',a)
# pandas series
b = pd.Series(range(0, 5))
print('This is a Pandas series\n', b)

This is a numpy array  [0 1 2 3 4]
This is a Pandas series
 0    0
1    1
2    2
3    3
4    4
dtype: int64


With pandas series we can change the index and assign it names, for instance:

In [26]:
age.index = ['chris','sam', 'leah', 'zelda', 'ryan', 'ghosling', 'hemsworth']
print(age)

chris        20
sam          23
leah         43
zelda        21
ryan         27
ghosling     25
hemsworth    26
dtype: int64


In [31]:
#indexing
age['chris']
#multiindex
age[['chris', 'zelda']]
#using iloc
age.iloc[0:3]

chris    20
sam      23
leah     43
dtype: int64

In [33]:
#in pandas series, the upper limit during indexing is always included. Note that
# this is not the case with python lists or numpy arrays
age['chris': 'ryan']
#here ryan is also included 

chris    20
sam      23
leah     43
zelda    21
ryan     27
dtype: int64

## Some exercise
## Date: 25/1/22

In [36]:
# Order (sort) the given pandas Series
X = pd.Series([4,2,5,1,3],
              index=['forth','second','fifth','first','third'])
X = X.sort_values()
print(X)

first     1
second    2
third     3
forth     4
fifth     5
dtype: int64


## Date: 6th Feb, 2022

In [3]:
#import sales.csv
df = pd.read_csv("sales_data.csv")
df

Unnamed: 0,Date,Day,Month,Year,Customer_Age,Age_Group,Customer_Gender,Country,State,Product_Category,Sub_Category,Product,Order_Quantity,Unit_Cost,Unit_Price,Profit,Cost,Revenue
0,2013-11-26,26,November,2013,19,Youth (<25),M,Canada,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950
1,2015-11-26,26,November,2015,19,Youth (<25),M,Canada,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950
2,2014-03-23,23,March,2014,49,Adults (35-64),M,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,23,45,120,1366,1035,2401
3,2016-03-23,23,March,2016,49,Adults (35-64),M,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,20,45,120,1188,900,2088
4,2014-05-15,15,May,2014,47,Adults (35-64),F,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,4,45,120,238,180,418
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113031,2016-04-12,12,April,2016,41,Adults (35-64),M,United Kingdom,England,Clothing,Vests,"Classic Vest, S",3,24,64,112,72,184
113032,2014-04-02,2,April,2014,18,Youth (<25),M,Australia,Queensland,Clothing,Vests,"Classic Vest, M",22,24,64,655,528,1183
113033,2016-04-02,2,April,2016,18,Youth (<25),M,Australia,Queensland,Clothing,Vests,"Classic Vest, M",22,24,64,655,528,1183
113034,2014-03-04,4,March,2014,37,Adults (35-64),F,France,Seine (Paris),Clothing,Vests,"Classic Vest, L",24,24,64,684,576,1260


In [7]:
#conditional selection (boolean arrays)
df.loc[df['Revenue'] < 400, 'Country']

6              Australia
7              Australia
16                Canada
17                Canada
18                Canada
               ...      
113015     United States
113028    United Kingdom
113029    United Kingdom
113030    United Kingdom
113031    United Kingdom
Name: Country, Length: 65445, dtype: object

In [9]:
#get multiple columns
df.loc[df['Revenue'] < 400, ['Country', 'State', 'Revenue']]

Unnamed: 0,Country,State,Revenue
6,Australia,Victoria,379
7,Australia,Victoria,190
16,Canada,British Columbia,238
17,Canada,British Columbia,119
18,Canada,British Columbia,119
...,...,...,...
113015,United States,Washington,300
113028,United Kingdom,England,123
113029,United Kingdom,England,123
113030,United Kingdom,England,369


In [10]:
#drop columns, set axis to 1
df.drop(['State'], axis = 1)
#most pandas operations are immutable, therefore if you check df, the column state still exists

Unnamed: 0,Date,Day,Month,Year,Customer_Age,Age_Group,Customer_Gender,Country,Product_Category,Sub_Category,Product,Order_Quantity,Unit_Cost,Unit_Price,Profit,Cost,Revenue
0,2013-11-26,26,November,2013,19,Youth (<25),M,Canada,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950
1,2015-11-26,26,November,2015,19,Youth (<25),M,Canada,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950
2,2014-03-23,23,March,2014,49,Adults (35-64),M,Australia,Accessories,Bike Racks,Hitch Rack - 4-Bike,23,45,120,1366,1035,2401
3,2016-03-23,23,March,2016,49,Adults (35-64),M,Australia,Accessories,Bike Racks,Hitch Rack - 4-Bike,20,45,120,1188,900,2088
4,2014-05-15,15,May,2014,47,Adults (35-64),F,Australia,Accessories,Bike Racks,Hitch Rack - 4-Bike,4,45,120,238,180,418
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113031,2016-04-12,12,April,2016,41,Adults (35-64),M,United Kingdom,Clothing,Vests,"Classic Vest, S",3,24,64,112,72,184
113032,2014-04-02,2,April,2014,18,Youth (<25),M,Australia,Clothing,Vests,"Classic Vest, M",22,24,64,655,528,1183
113033,2016-04-02,2,April,2016,18,Youth (<25),M,Australia,Clothing,Vests,"Classic Vest, M",22,24,64,655,528,1183
113034,2014-03-04,4,March,2014,37,Adults (35-64),F,France,Clothing,Vests,"Classic Vest, L",24,24,64,684,576,1260
