# Readme

• These are the learnings on Pandas while I was working on crypto trading project. https://github.com/abhishek4563/crypto-trading-app 

• Created this to practice stuff learnt from tuts/videos on Pandas. I kept hitting some blockers due to my panda skills in the crypto algo project so decided to upskill in Pandas before moving forwards.

• This Jupyter notebook corresponds to my OneNote under crypto trading project/Pandas. The notes has several notes, concepts links and  code. This Jupyter notebook should be used in conjunction with the OneNote note for proper understanding.
	

# Resources
	• This is an excellent course by freecode camp on data analysis in Python, I gleamed over the pandas and numpy sections as they provide excellent understanding of the topics, later I might do the whole course for practice
	• This is a nice youtube tutorial going over basics of Pandas and creating dataframes in good depth 
	• Nice tutorial with good level of explanation of common pandas tasks
	• Good tutorial by data camp
    • Extensive tutorial by Geeks for Geeks but better as a reference !

In [2]:
import numpy as np 
import pandas as pd
import json 
import matplotlib.pyplot as plt
import seaborn as sns
import random
import datetime as dt

# NumPy Basics

### Creating a Numpy Array 

In [3]:
np.array([1,2,3,4])

array([1, 2, 3, 4])

In [4]:
np.array(['1','2','3','4']) # while Numpy can technically store strings its really not made for them (but instead for numbers and numerical calculations)

array(['1', '2', '3', '4'], dtype='<U1')

In [5]:
a=np.array([1,2,3,4])

In [6]:
a

array([1, 2, 3, 4])

### Basic Numpy Operations

In [7]:
a[1]# we can do usual stuff as lists. Note that its actually returining the value (not the array - see below)

2

In [8]:
a[1:3] # note that its returning a Numpy array 

array([2, 3])

In [9]:
a[:-1]

array([1, 2, 3])

In [10]:
a[1],a[2],a[0]

(2, 3, 1)

In [11]:
a[[1,2,0]] # we can also call is passing a list of indexes we want to grab. It RETURNS an array in this case.

array([2, 3, 1])

In [12]:
a[[1]] # returns an array if we

array([2])

### Data types

In [13]:
a.dtype # Numpy Assigns datatypes depending on type of data we are using it for,
            #we can even specify this in code to be super efficient


dtype('int64')

### Multidimensional Numpy Arrays 

In [14]:
b = np.array ([[1,2,3], [4,5,6]])

In [15]:
b

array([[1, 2, 3],
       [4, 5, 6]])

In [16]:
print (b.dtype,
b.shape,
b.size,b.)

SyntaxError: invalid syntax (426161129.py, line 3)

In [19]:
b[0][1] # this would be a way of referring to a particular cell
        # Note that this is different from passing an array of indices 
        #that we want returned, that will have a different syntax ([[]])

2

In [20]:
b[0,1] # however, Numpy also allows us this shorthand for referring to elements, which is super handy when slicing

2

In [22]:
b[:1,:3] # slicing works exactly the same as Python lists (although its different for Pandas!)
      

array([[1, 2, 3]])

### Vectorization

In [23]:
a+10 # vectorisation means that this operation will be applied to all elements of the array 

array([11, 12, 13, 14])

In [24]:
a*10

array([10, 20, 30, 40])

### Boolean Arrays

Vectorization and Boolean Arrays 
• Vectorisation: a+10 or a*10, a + b, etc. vectorisation means that this operation will be applied to all elements of the array 
• Similarly a>2 will return a boolean array of trues and falses
• 3 ways of selection
	• a[1], a[2]
	• a[[1,2]]
	• a[[true,false,true]
		○ This means we can also do stuff like [a[a>2]] which is really powerful way of selecting stuff 
![image.png](attachment:image.png)


In [25]:
a>2

array([False, False,  True,  True])

In [26]:
a[a>2] # there are 3 ways of selection, individual items, pass an array of indices, pass an array of boolean
        # vectorisation means that we can generate this array with simple, intuitive syntax as here
        # no need for extra [] as vectorised operation already generates an array 

array([3, 4])

In [27]:
a[(a>a.mean()) | (a<4)] # we can combine expressions using | & etc BUT not use keywords and or etc. 
                        #like we do in regular python else it will throw error

array([1, 2, 3, 4])

In [28]:
a.T # transfor, dot product, and other matrix operations etc. are already implemented and are very fast 

array([1, 2, 3, 4])

# Pandas

## Pandas Series 

### Intro

In [29]:
g7_pop =  pd.Series([35.46, 63.95, 80.9, 60.6,128.0,64,310])

In [30]:
g7_pop # it has a name, data type  and also an index, built on top of numpy array, different from a list 

0     35.46
1     63.95
2     80.90
3     60.60
4    128.00
5     64.00
6    310.00
dtype: float64

In [31]:
g7_pop.index = ['canada','france','germany', 'italy','japan','uk','us']

In [32]:
g7_pop # looks like a dictionary now? but it is ordered!

canada      35.46
france      63.95
germany     80.90
italy       60.60
japan      128.00
uk          64.00
us         310.00
dtype: float64

### Viewing/Selection/Boolean Arrays

In [33]:
g7_pop['us']

310.0

In [34]:
g7_pop[1]

63.95

In [35]:
g7_pop.iloc[1] # althiugh it has an index, we could also refer to it by interger location 
                #even though the indexes are actually contry names!

63.95

In [36]:
g7_pop['canada':'italy']# Italy is INCLUDED this is different from regular python or NumPy

canada     35.46
france     63.95
germany    80.90
italy      60.60
dtype: float64

In [37]:
    g7_pop*1000000

canada      35460000.0
france      63950000.0
germany     80900000.0
italy       60600000.0
japan      128000000.0
uk          64000000.0
us         310000000.0
dtype: float64

In [38]:
g7_pop>70 # the vectorised operations return a Pandas series (instead of an array) but the concepts of boolean
            # arrays still applies here and is rather powerful

canada     False
france     False
germany     True
italy      False
japan       True
uk         False
us          True
dtype: bool

In [39]:
g7_pop[g7_pop>70] # boolean series can be used to select/filter just as in Numpy, although instead of array, we 

germany     80.9
japan      128.0
us         310.0
dtype: float64

In [40]:
g7_pop.mean()

106.13

### Modifying series 

In [41]:
g7_pop['us'] =350
g7_pop

canada      35.46
france      63.95
germany     80.90
italy       60.60
japan      128.00
uk          64.00
us         350.00
dtype: float64

In [42]:
g7_pop[0]=34.5
g7_pop

canada      34.50
france      63.95
germany     80.90
italy       60.60
japan      128.00
uk          64.00
us         350.00
dtype: float64

In [43]:
g7_pop.iloc[-1]=34.5
g7_pop

canada      34.50
france      63.95
germany     80.90
italy       60.60
japan      128.00
uk          64.00
us          34.50
dtype: float64

In [44]:
g7_pop[g7_pop>70] = g7_pop*1.2
g7_pop

canada      34.50
france      63.95
germany     97.08
italy       60.60
japan      153.60
uk          64.00
us          34.50
dtype: float64

In [45]:
g7_pop.iloc?

## Dataframes

### Dataframe Basics

In [46]:
pd.DataFrame?

In [47]:
# creating a df using a dictionary of name:column values, then optionally 
# defining the columns and indexes as well

df= pd.DataFrame({'Population':[35.46,63.95,80.90,60.6,128,64,350],
                'GDP':[
                    1785387,
                    2833687,
                    3874437,
                    2167744,
                    4602367,
                    2950039,
                    17348075
                ],
                'Surface Area':[
                     9984670,
                    640679,
                    357114,
                    301336,
                    377930,
                    242495,
                    9525067
                ],
                  'HDI': [
                    0.913,
                    0.888,
                    0.916,
                    0.873,
                    0.891,
                    0.907,
                    0.915
                ],
                'Continent': [
                    'America',
                    'Europe',
                    'Europe',
                    'Europe',
                    'Asia',
                    'Europe',
                    'America'
                ]
                },columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])

In [48]:
df # dataframes are basically a combination of Pandas series, they have a column name and an index (row name) 
    # and are built on top of Numpy ndarrays 

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
0,35.46,1785387,9984670,0.913,America
1,63.95,2833687,640679,0.888,Europe
2,80.9,3874437,357114,0.916,Europe
3,60.6,2167744,301336,0.873,Europe
4,128.0,4602367,377930,0.891,Asia
5,64.0,2950039,242495,0.907,Europe
6,350.0,17348075,9525067,0.915,America


In [49]:
df.index=[
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

In [50]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.46,1785387,9984670,0.913,America
France,63.95,2833687,640679,0.888,Europe
Germany,80.9,3874437,357114,0.916,Europe
Italy,60.6,2167744,301336,0.873,Europe
Japan,128.0,4602367,377930,0.891,Asia
United Kingdom,64.0,2950039,242495,0.907,Europe
United States,350.0,17348075,9525067,0.915,America


In [84]:
df= pd.DataFrame({'Population':[35.46,63.95,80.90,60.6,128,64,350],
                'GDP':[
                    1785387,
                    2833687,
                    3874437,
                    2167744,
                    4602367,
                    2950039,
                    17348075
                ],
                'Surface Area':[
                     9984670,
                    640679,
                    357114,
                    301336,
                    377930,
                    242495,
                    9525067
                ],
                  'HDI': [
                    0.913,
                    0.888,
                    0.916,
                    0.873,
                    0.891,
                    0.907,
                    0.915
                ],
                'Continent': [
                    'America',
                    'Europe',
                    'Europe',
                    'Europe',
                    'Asia',
                    'Europe',
                    'America'
                ]
                },columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Conti'],
                index=[
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'US',
                ])

In [85]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Conti
Canada,35.46,1785387,9984670,0.913,
France,63.95,2833687,640679,0.888,
Germany,80.9,3874437,357114,0.916,
Italy,60.6,2167744,301336,0.873,
Japan,128.0,4602367,377930,0.891,
United Kingdom,64.0,2950039,242495,0.907,
US,350.0,17348075,9525067,0.915,


In [53]:
df.info() # there are a number of interesting functions that we get from pandas that are helpful to 
            # quickly learn about the data

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, Canada to US
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Population    7 non-null      float64
 1   GDP           7 non-null      int64  
 2   Surface Area  7 non-null      int64  
 3   HDI           7 non-null      float64
 4   Conti         0 non-null      object 
dtypes: float64(2), int64(2), object(1)
memory usage: 336.0+ bytes


In [54]:
df.index

Index(['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom',
       'US'],
      dtype='object')

In [55]:
df.columns

Index(['Population', 'GDP', 'Surface Area', 'HDI', 'Conti'], dtype='object')

In [56]:
df.size

35

In [57]:
df.shape

(7, 5)

In [58]:
df.describe().T # ONLY for numeric columns!

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Population,7.0,111.8443,108.7659,35.46,62.275,64.0,104.45,350.0
GDP,7.0,5080248.0,5494020.0,1785387.0,2500716.0,2950039.0,4238402.0,17348080.0
Surface Area,7.0,3061327.0,4576187.0,242495.0,329225.0,377930.0,5082873.0,9984670.0
HDI,7.0,0.9004286,0.01659174,0.873,0.8895,0.907,0.914,0.916


In [None]:
df.dtypes

### Indexing selecting and slicing 

In [66]:
df.loc['Canada'] # loc and iloc work horizontally while [] works vertically. 
                    #All of them return a Pandas series and require square brackets 
                    # .loc[] works on indexes, .iloc[] works on sequential ordering 
# df.loc[0] throws error
# df.loc['Population'] throws error 

Population        35.46
GDP             1785387
Surface Area    9984670
HDI               0.913
Conti               NaN
Name: Canada, dtype: object

In [67]:
df.loc['Canada']['Population']


35.46

In [68]:
df.iloc[1]


Population        63.95
GDP             2833687
Surface Area     640679
HDI               0.888
Conti               NaN
Name: France, dtype: object

In [69]:
df.iloc[1][0]


63.95

In [70]:
df['GDP']
# df[0] throws error
# df['Canada'] throws error


Canada             1785387
France             2833687
Germany            3874437
Italy              2167744
Japan              4602367
United Kingdom     2950039
US                17348075
Name: GDP, dtype: int64

In [71]:
df['Population']


Canada             35.46
France             63.95
Germany            80.90
Italy              60.60
Japan             128.00
United Kingdom     64.00
US                350.00
Name: Population, dtype: float64

In [72]:
df.loc['France':'Italy'] # notice the .loc for slicing, else same as Series 

Unnamed: 0,Population,GDP,Surface Area,HDI,Conti
France,63.95,2833687,640679,0.888,
Germany,80.9,3874437,357114,0.916,
Italy,60.6,2167744,301336,0.873,


In [73]:
df.loc['France':'Italy','Population'] # first argument is filtering criteria, second is the columns we want!!!

France     63.95
Germany    80.90
Italy      60.60
Name: Population, dtype: float64

In [74]:
df.loc['France':'Italy']['Population'] 


France     63.95
Germany    80.90
Italy      60.60
Name: Population, dtype: float64

In [75]:
df.loc['France':'Italy',['Population','GDP']] # if more than one columns we need to provide as an array 
                                            # note that we are using .loc[] for this not just []

Unnamed: 0,Population,GDP
France,63.95,2833687
Germany,80.9,3874437
Italy,60.6,2167744


In [76]:
df.iloc[1:3] # select rows by position using iloc

Unnamed: 0,Population,GDP,Surface Area,HDI,Conti
France,63.95,2833687,640679,0.888,
Germany,80.9,3874437,357114,0.916,


In [77]:

df.iloc[1:3,2:4] # slicing by rows and columns using loc as well as iloc
                # note that it will n-1 for slicing not upto n as we will expect with regular python

Unnamed: 0,Surface Area,HDI
France,640679,0.888
Germany,357114,0.916


### Condiitonal selection

In [59]:
df['Population']  >70

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
US                 True
Name: Population, dtype: bool

In [60]:
df.loc[df['Population']>70] # we can filter based on vectorised logical operation BUT NOTE the use of .loc[]
                            # we cant use .iloc[] or [], they will throw error!


Unnamed: 0,Population,GDP,Surface Area,HDI,Conti
Germany,80.9,3874437,357114,0.916,
Japan,128.0,4602367,377930,0.891,
US,350.0,17348075,9525067,0.915,


In [61]:
df.loc[df['Population']>70,'Population'] # we can add multiple dimenstions as well to filter for the columns we need


Germany     80.9
Japan      128.0
US         350.0
Name: Population, dtype: float64

In [62]:
df.loc[df['Population']>70,['Population','GDP']]


Unnamed: 0,Population,GDP
Germany,80.9,3874437
Japan,128.0,4602367
US,350.0,17348075


### Dropping Stuff

Dropping stuff is sort of the inverse of selection but we dont really drop anything! 

In [63]:
df.drop('Germany')

Unnamed: 0,Population,GDP,Surface Area,HDI,Conti
Canada,35.46,1785387,9984670,0.913,
France,63.95,2833687,640679,0.888,
Italy,60.6,2167744,301336,0.873,
Japan,128.0,4602367,377930,0.891,
United Kingdom,64.0,2950039,242495,0.907,
US,350.0,17348075,9525067,0.915,


In [64]:
df.drop(['Germany','Japan']) # use a list if more than one rows (or columns)

# we can drop stuff by specifying rows to be dropped or specifying columns (See syntax)

Unnamed: 0,Population,GDP,Surface Area,HDI,Conti
Canada,35.46,1785387,9984670,0.913,
France,63.95,2833687,640679,0.888,
Italy,60.6,2167744,301336,0.873,
United Kingdom,64.0,2950039,242495,0.907,
US,350.0,17348075,9525067,0.915,


In [65]:
df.drop(columns = ['Population','GDP'])# dropping columns 

Unnamed: 0,Surface Area,HDI,Conti
Canada,9984670,0.913,
France,640679,0.888,
Germany,357114,0.916,
Italy,301336,0.873,
Japan,377930,0.891,
United Kingdom,242495,0.907,
US,9525067,0.915,


In [90]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Conti,GDP/Capita
Canada,35.46,1785387,9984670,0.913,,50349.323181
France,63.95,2833687,640679,0.888,,44310.977326
Germany,80.9,3874437,357114,0.916,,47891.681088
Italy,60.6,2167744,301336,0.873,,35771.353135
Japan,128.0,4602367,377930,0.891,,35955.992188
United Kingdom,64.0,2950039,242495,0.907,,46094.359375
US,350.0,17348075,9525067,0.915,,49565.928571


#### Most (99.9%) of Pandas operations are immutable, they are not changing the underlying dataframe

### Modifying dataframes

### Creating new columns from existing columns

In [89]:
df['GDP/Capita'] = df['GDP']/df['Population']
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Conti,GDP/Capita
Canada,35.46,1785387,9984670,0.913,,50349.323181
France,63.95,2833687,640679,0.888,,44310.977326
Germany,80.9,3874437,357114,0.916,,47891.681088
Italy,60.6,2167744,301336,0.873,,35771.353135
Japan,128.0,4602367,377930,0.891,,35955.992188
United Kingdom,64.0,2950039,242495,0.907,,46094.359375
US,350.0,17348075,9525067,0.915,,49565.928571


 <font color ='red'>**Until this point, I have done the portion of this video https://www.youtube.com/watch?v=r-uOLxNrNk8&ab_channel=freeCodeCamp.org until 1h:05 to 2h:37 minutes , and will get back to rest of the tutorial later. This had covered MumPy basics and Pandas basics** </font.

# from this point on I am using this tutorial 
https://towardsdatascience.com/pandas-full-tutorial-on-a-single-dataset-4aa43461e1e2 

## <font color='red'>it has some really important concepts so definitely dont skip this one </font>

### Some important concepts

1) Understanding “Axis”
A DataFrame object has two axes: “axis 0” and “axis 1”:

axis 0: Wherever you see this -> it represents rows
axis 1: Wherever you see this -> it represents columns


2) Understanding “Inplace”
Understanding the “inplace” parameter can help us a lot of time and memory!

When inplace = False -> which is the default, then the operation is performed and it returns a copy of the object. You then need to save it to something.

temp=df.set_index(‘CustomerId’)# here by Default inplace = False
temp
While , When inplace = True -> the data is modified in place, which means it will return nothing and the dataframe is now updated.

df.set_index(‘CustomerId’,inplace=True)
df

# from this point on I am using this tutorial
https://www.datacamp.com/tutorial/pandas-tutorial-dataframe-python