# Lab 04 - Pandas Library - Andrew Badzioch

### Pandas Introduction
1. Pandas is another powerful Python library for data manipulation.
2. It is built on top of NumPy.
3. Pandas can operate on both numeric and text data types.
4. The two main data structures in Pandas are: Series and DataFrames.
5. The Series are 1D structures whereas DataFrames are 2D structures (tabular: rows and columns).
6. The most reliable source for Pandas is:  https://pandas.pydata.org/


#### Pandas installing and importing

In [1]:
# install pandas 
!pip install pandas



In [2]:
# import pandas
import pandas as pd
import numpy as np
import math

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [3]:
# check version
pd.__version__

'2.2.2'

#### Pandas Data Structures
1. Series: A Series (1D) stat structure is made up of two parts:
    - the index
    - the values (column)
2. DataFrame: A DataFrame (df) is a 2D (rows and columns) data structure.

##### Pandas Series

In [4]:
# 1. creating a pandas series from a Python dictionary
d = {'a':1, 'b':2, 'c':3, 'd':4, 5:'andy'}
d

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 5: 'andy'}

In [5]:
type(d)

dict

In [6]:
# convert d (dict) to a pandas series
s1 = pd.Series(d)
s1

a       1
b       2
c       3
d       4
5    andy
dtype: object

In [7]:
type(s1)

pandas.core.series.Series

##### Attributes

In [8]:
s1.ndim

1

In [9]:
s1.shape

(5,)

In [10]:
s1.dtypes

dtype('O')

In [11]:
s1.size

5

In [12]:
s1.index

Index(['a', 'b', 'c', 'd', 5], dtype='object')

In [13]:
s1.values

array([1, 2, 3, 4, 'andy'], dtype=object)

In [14]:
# 2. creating a pandas series from a Python list
s2 = pd.Series([67, 89, 100, 55, 95], name='grades')
s2

0     67
1     89
2    100
3     55
4     95
Name: grades, dtype: int64

In [15]:
# adding a custom index
s3 = pd.Series(['Paris', 'Houston', 'London', 'Boston', 'Toronoto', 'Honolulu'], name='Cites', index=['P', 'H', 'L', 'B', 'T', 'H'])
s3

P       Paris
H     Houston
L      London
B      Boston
T     Tornoto
H    Honolulu
Name: Cites, dtype: object

In [16]:
# indexing H in s3
s3['H']

H     Houston
H    Honolulu
Name: Cites, dtype: object

In [17]:
s3['L']

'London'

In [18]:
# slicing on s3
s3[:3]

P      Paris
H    Houston
L     London
Name: Cites, dtype: object

In [19]:
s3[-1]

'Honolulu'

In [20]:
s3[2:5]

L     London
B     Boston
T    Tornoto
Name: Cites, dtype: object

### Pandas DataFrames (df)
1. A Pandas DataFrame (df) is a 2D (two-dimensional) array with row and columns.
2. DataFrames are the most common used data structures in Pandas.
3. There are two axises in a DataFrame: axis=0 (for rows), axis=1 (columns).
4. Python dictionaries are the only 2D structure that we can create a DataFrame with.
5. To create a Pandas DataFrame, we need the minimum of two dictionaries.
6. A DataFrame is a dictionary of dictionaries.

In [21]:
# create a df based on 2 dictionaries
d

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 5: 'andy'}

In [22]:
d1 = {1:'swan', 2:'eagle', 3:'dove', 4:'parrot', 5:'turkey'}

In [23]:
len(d)

5

##### Converting Python Dictionaries to Pandas df

In [24]:
# creating a df based on d and d1
# 1. adding the 2 dictionaries together
d2 = [d, d1]
d2

[{'a': 1, 'b': 2, 'c': 3, 'd': 4, 5: 'andy'},
 {1: 'swan', 2: 'eagle', 3: 'dove', 4: 'parrot', 5: 'turkey'}]

In [25]:
type(d2)

list

In [26]:
# 2. convert python list to a pandas df
df = pd.DataFrame(d2)
df

Unnamed: 0,a,b,c,d,5,1,2,3,4
0,1.0,2.0,3.0,4.0,andy,,,,
1,,,,,turkey,swan,eagle,dove,parrot


In [27]:
df.shape

(2, 9)

**Observation**

In [28]:
df.ndim

2

In [29]:
df.size

18

In [30]:
df.dtypes

a    float64
b    float64
c    float64
d    float64
5     object
1     object
2     object
3     object
4     object
dtype: object

In [31]:
# check the missing values in df
df.isnull()

Unnamed: 0,a,b,c,d,5,1,2,3,4
0,False,False,False,False,False,True,True,True,True
1,True,True,True,True,False,False,False,False,False


In [32]:
df.isnull().sum()

a    1
b    1
c    1
d    1
5    0
1    1
2    1
3    1
4    1
dtype: int64

**Observations:**

In [33]:
# create a df from a numpy array
ary = np.array([[19, 24000], [24, 45000], [59, 102000], [78, 120000]])

In [34]:
df1 = pd.DataFrame(ary)
df1

Unnamed: 0,0,1
0,19,24000
1,24,45000
2,59,102000
3,78,120000


In [35]:
type(df1)

pandas.core.frame.DataFrame

In [36]:
# convert the array to df
df1 = pd.DataFrame(ary, columns=['Age', 'Income'],
                    index=['Sara', 'Mike', 'Lily', 'Andy'])
df1

Unnamed: 0,Age,Income
Sara,19,24000
Mike,24,45000
Lily,59,102000
Andy,78,120000


In [37]:
df1.shape

(4, 2)

### Conclusion:
- Add 10 take aways from completing this lab.
1. A Series is a one-dimensional labeled array that can hold any data type (int, str, float)
2. Each element in a Series has an associated label or index, which can be numeric or custom
3. All elements in a Series are of the same data type
4.
5.
6. A DataFrame is a two-dimensional data structure, essentially a table with rows and columns.
7. Columns in a DataFrame can be of different data types
8. Attributes give useful information about the data and don't require parentheses () when accessed   .ndim, .shape, .size
9. methods perform operstions on the Series and DataFrame objects, and require parentheses () when called    .info(), .isnull()
10. 

#### End of Lab 04