# Introduction to Pandas

### What is Pandas?

Built on NumPy, Pandas is an open-source high-level data manipulation tool that lets you analyse,
clean and manipulate data in what are called DataFrames. 
<br><br>A DataFrame is just an object that stores data in a table-like structure of rows (observations) and columns (Features/Labels).

### Series

A series is an indexed single dimentional object (a Dataframe with only a single column) the default index starts from 0 goes up to the length of the series.

In [None]:
import numpy as np
import pandas as pd

In [None]:
#create a series with default indexing
s1 = pd.Series([0,1,3,4,5])
s1

In [None]:
s1 = pd.Series([0,'Hello','World',4,5.2])
s1

In [None]:
#create a series with specific indices
s1 = pd.Series([3, 4, 5, 6 ,7], index=['1st', '2nd', '3rd', '4th', '5th'])
s1

In [None]:
s1['3rd']

In [None]:
s1[s1 > 4]

In [None]:
s1[s1%2==0]

In [None]:
s1[1:-1]

In [None]:
s2 = pd.Series([0,'Hello','World',4,5.2])
s2.append(s1)

### DataFrame

In [None]:
#empty dataframe
df = pd.DataFrame(columns=['First_name', 'Last_name', 'Age', 'Gender'])
df

In [None]:
#insert ordered values
df.loc['p1'] = ['Winston', 'Bishop', 32, 'Male']
df.loc['p2'] = ['Jessica', 'Day', 31, 'Female']
df

In [None]:
#insert unordered values
df.loc[3] = dict(Gender='Male', First_name='Nick', Age=33, Last_name='Miller')
df

In [None]:
#create a dataframe from ordered lists
l = [['Winston', 'Schmidt', 32, 'Male'], ['Cece', '', 31, 'Female']]
df = pd.DataFrame(l, index=[0, 'a'], columns=['First_name', 'Last_name', 'Age', 'Gender'])
df

In [None]:
#create a dataframe from a dict
d = {'a':['Jessica', 'Day', 31, 'Female'], 'b':['Nick', 'Miller', 33, 'Male']}

In [None]:
df = pd.DataFrame(d, index=['Gender', 'Age', 'Height', 'Weight'])
df

In [None]:
df = pd.DataFrame.from_dict(d, orient='index')
df.columns=['Gender', 'Age', 'Height', 'Weight']
df

In [None]:
df.T

#### Loading a CSV file to a Dataframe

In [None]:
#from google.colab import drive 
#drive.mount('/content/gdrive')
#trees =pd.read_csv('gdrive/My Drive/trees.csv')
import io
from google.colab import files

uploaded = files.upload()
trees = pd.read_csv(io.StringIO(uploaded['trees.csv'].decode('utf-8')))
trees

In [None]:
trees.head()

In [None]:
trees.tail(7)

In [None]:
trees.drop('Index', axis=1, inplace=True)
trees.head()

In [None]:
trees.info()

In [None]:
trees.describe()

#### Column selection

In [None]:
trees['Height'].head()

In [None]:
type(trees['Height'])

In [None]:
trees[['Height', 'Volume']].head(3)

#### Row selection

In [None]:
trees[5:10]

In [None]:
trees.iloc[5:10]

In [None]:
trees[trees['Height'] > 80]

In [None]:
trees[(trees['Height'] > 80) & (trees['Girth'] < 12)]

The difference between `iloc()` and `loc()`

In [None]:
trees2 = trees[trees['Height'] > 80]
trees2

In [None]:
trees2.iloc[:5]

In [None]:
trees2.loc[:5]

In [None]:
sorted_trees = trees.sort_values('Height')
sorted_trees.iloc[:7]

In [None]:
sorted_trees.loc[:7]

In [None]:
trees2.to_numpy()

In [None]:
trees2['Girth'].to_numpy()

# ===================================================

### Excersies:

#### Q1

Read a dataset of students grades `grades.csv` and put it in a dataframe then display the first 5 observations to get an idea about what you are dealing with.

In [19]:
# to import file from google drive to google colab we use the following code assuming that the file is in themain directory in your drive. (this will require to accept an access permission to Google Drive)
##from google.colab import drive 
##drive.mount('/content/gdrive')
##trees =pd.read_csv('gdrive/My Drive/trees.csv')

# this code is for importing the file from your local machine.
import io
from google.colab import files

uploaded = files.upload()
grades = pd.read_csv(io.StringIO(uploaded['grades.csv'].decode('utf-8')))
grades.head()

# or you can use this code if you are using Jupyter notebook instead of google colab (given that grades.csv is in the same directory as your notebook file)
#grades = pd.read_csv('./grades.csv')

Saving grades.csv to grades.csv


Unnamed: 0,Lastname,Firstname,SSN,Test1,Test2,Test3,Test4,Final,Grade
0,AlfalfaNULL,Aloysius,123-45-6789,40.0,90.0,100.0,83.0,49.0,D-
1,Alfred,University,123-12-1234,41.0,97.0,96.0,97.0,48.0,D+
2,Gerty,Gramma,567-89-0123,41.0,80.0,60.0,40.0,44.0,C
3,Android,Electric,087-65-4321,42.0,23.0,36.0,45.0,47.0,B-
4,Bumpkin,Fred,456-78-9012,43.0,78.0,88.0,77.0,45.0,A-


#### Q2

Display the last 3 observations from the dataset

In [20]:
grades.tail(3)

Unnamed: 0,Lastname,Firstname,SSN,Test1,Test2,Test3,Test4,Final,Grade
13,Franklin,Benny,234-56-2890,50.0,1.0,90.0,80.0,90.0,B-
14,George,Boy,345-67-3901,40.0,1.0,11.0,-1.0,4.0,B
15,Heffalump,Harvey,632-79-9439,30.0,1.0,20.0,30.0,40.0,C


#### Q3

Display some information about the data. (datatype and count)

In [21]:
grades.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Lastname   16 non-null     object 
 1   Firstname  16 non-null     object 
 2   SSN        16 non-null     object 
 3   Test1      16 non-null     object 
 4   Test2      16 non-null     float64
 5   Test3      16 non-null     float64
 6   Test4      16 non-null     float64
 7   Final      16 non-null     object 
 8   Grade      15 non-null     object 
dtypes: float64(3), object(6)
memory usage: 1.2+ KB


#### Q4

Display a descriptive statistical summary about the data.

In [22]:
grades.describe()

Unnamed: 0,Test2,Test3,Test4
count,16.0,16.0,16.0
mean,36.625,61.75,59.25
std,41.463036,35.768701,32.08842
min,1.0,-1.0,-1.0
25%,1.0,28.25,39.0
50%,15.5,79.0,68.5
75%,82.5,91.5,84.25
max,97.0,100.0,97.0


#### Q5

Remove the column `SSN`, which is the social security number, as we dont need it. Make sure that the original is updated as well.

In [23]:
grades.drop('SSN', axis=1, inplace=True)
grades.head()

Unnamed: 0,Lastname,Firstname,Test1,Test2,Test3,Test4,Final,Grade
0,AlfalfaNULL,Aloysius,40.0,90.0,100.0,83.0,49.0,D-
1,Alfred,University,41.0,97.0,96.0,97.0,48.0,D+
2,Gerty,Gramma,41.0,80.0,60.0,40.0,44.0,C
3,Android,Electric,42.0,23.0,36.0,45.0,47.0,B-
4,Bumpkin,Fred,43.0,78.0,88.0,77.0,45.0,A-


#### Q6

Sort the data based on the `Final` grade and show the top 3 students in the class.

In [33]:
grades.sort_values('Final').tail(3)

Unnamed: 0,Lastname,Firstname,Test1,Test2,Test3,Test4,Final,Grade
13,Franklin,Benny,50.0,1.0,90.0,80.0,90.0,B-
9,Backus,Jim,48.0,1.0,97.0,96.0,97.0,A+
8,Airpump,Andrew,49.01.0,90.0,100.0,83.0,A,
