# Introduction to Pandas

### What is Pandas?

Built on NumPy, Pandas is an open-source high-level data manipulation tool that lets you analyse,
clean and manipulate data in what are called DataFrames. 
<br><br>A DataFrame is just an object that stores data in a table-like structure of rows (observations) and columns (Features/Labels).

### Series

A series is an indexed single dimentional object (a Dataframe with only a single column) the default index starts from 0 goes up to the length of the series.

In [None]:
import numpy as np
import pandas as pd

In [None]:
#create a series with default indexing
s1 = pd.Series([0,1,3,4,5])
s1

In [None]:
s1 = pd.Series([0,'Hello','World',4,5.2])
s1

In [None]:
#create a series with specific indices
s1 = pd.Series([3, 4, 5, 6 ,7], index=['1st', '2nd', '3rd', '4th', '5th'])
s1

In [None]:
s1['3rd']

In [None]:
s1[s1 > 4]

In [None]:
s1[s1%2==0]

In [None]:
s1[1:-1]

In [None]:
s2 = pd.Series([0,'Hello','World',4,5.2])
s2.append(s1)

### DataFrame

In [None]:
#empty dataframe
df = pd.DataFrame(columns=['First_name', 'Last_name', 'Age', 'Gender'])
df

In [None]:
#insert ordered values
df.loc['p1'] = ['Winston', 'Bishop', 32, 'Male']
df.loc['p2'] = ['Jessica', 'Day', 31, 'Female']
df

In [None]:
#insert unordered values
df.loc[3] = dict(Gender='Male', First_name='Nick', Age=33, Last_name='Miller')
df

In [None]:
#create a dataframe from ordered lists
l = [['Winston', 'Schmidt', 32, 'Male'], ['Cece', '', 31, 'Female']]
df = pd.DataFrame(l, index=[0, 'a'], columns=['First_name', 'Last_name', 'Age', 'Gender'])
df

In [None]:
#create a dataframe from a dict
d = {'a':['Jessica', 'Day', 31, 'Female'], 'b':['Nick', 'Miller', 33, 'Male']}

In [None]:
df = pd.DataFrame(d, index=['Gender', 'Age', 'Height', 'Weight'])
df

In [None]:
df = pd.DataFrame.from_dict(d, orient='index')
df.columns=['Gender', 'Age', 'Height', 'Weight']
df

In [None]:
df.T

#### Loading a CSV file to a Dataframe

In [None]:
#from google.colab import drive 
#drive.mount('/content/gdrive')
#trees =pd.read_csv('gdrive/My Drive/trees.csv')
import io
from google.colab import files

uploaded = files.upload()
trees = pd.read_csv(io.StringIO(uploaded['trees.csv'].decode('utf-8')))
trees

In [None]:
trees.head()

In [None]:
trees.tail(7)

In [None]:
trees.drop('Index', axis=1, inplace=True)
trees.head()

In [None]:
trees.info()

In [None]:
trees.describe()

#### Column selection

In [None]:
trees['Height'].head()

In [None]:
type(trees['Height'])

In [None]:
trees[['Height', 'Volume']].head(3)

#### Row selection

In [None]:
trees[5:10]

In [None]:
trees.iloc[5:10]

In [None]:
trees[trees['Height'] > 80]

In [None]:
trees[(trees['Height'] > 80) & (trees['Girth'] < 12)]

The difference between `iloc()` and `loc()`

In [None]:
trees2 = trees[trees['Height'] > 80]
trees2

In [None]:
trees2.iloc[:5]

In [None]:
trees2.loc[:5]

In [None]:
sorted_trees = trees.sort_values('Height')
sorted_trees.iloc[:7]

In [None]:
sorted_trees.loc[:7]

In [None]:
trees2.to_numpy()

In [None]:
trees2['Girth'].to_numpy()

# ===================================================

### Excersies:

#### Q1

Read a dataset of students grades `grades.csv` and put it in a dataframe then display the first 5 observations to get an idea about what you are dealing with.

#### Q2

Display the last 3 observations from the dataset

#### Q3

Display some information about the data. (datatype and count)

#### Q4

Display a descriptive statistical summary about the data.

#### Q5

Remove the column `SSN`, which is the social security number, as we dont need it. Make sure that the original is updated as well.

#### Q6

Sort the data based on the `final` grade and show the top 3 students in the class.