# Python Cheatsheets - DATA ANALYSIS

## Data Manipulation

### Pandas
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. It provides high-performance, easy-to-use data structures and data analysis tools for Python. 

More information: https://pandas.pydata.org/

In [None]:
# RUN IF PANDAS IS NOT INSTALLED

# install pandas either with conda or with pip
# conda install -c conda-forge pandas
!pip install pandas

# verify if pandas is installed
# pip show pandas

In [6]:
# import necessary libraries for pandas and other necessary libraries
import pandas as pd
import numpy as np
import csv

Pandas has two types of classes to handle data:

- __Series__: a one-dimensional labeled array holding data of any type such as integers, strings, Python objects etc.

- __DataFrame__: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns. 

In [8]:
# create a series by passing a list of values
s = pd.Series([1, 3, 4, np.nan, 6, 9])
print(s)

0    1.0
1    3.0
2    4.0
3    NaN
4    6.0
5    9.0
dtype: float64


A DataFrame can be created by passing a dictionary of objects where the keys are the column labels and the values are the column values.

In [12]:
# create a DataFrame with a dictionary
names = ['Germany', 'Australia', 'Japan', 'India', 'China', 'United Kingdom']
cap =  ['Berlin', 'Canberra', 'Tokyo', 'New Delhi', 'Beijing', 'London']
pop = [84552242, 26713205, 123753041, 1450935791, 1419321278, 69138192]
codes = ['DE', 'AU', 'JP', 'IN', 'CN', 'GB']

# create dictionary my_dict with three key:value pairs: my_dict
my_dict = {
    'Country':names,
    'Capital':cap,
    'Population':pop
}

# build a DataFrame countries from my_dict: countries
countries = pd.DataFrame(my_dict)
countries.index = codes

# print DataFrame
print(countries)

           Country    Capital  Population
DE         Germany     Berlin    84552242
AU       Australia   Canberra    26713205
JP           Japan      Tokyo   123753041
IN           India  New Delhi  1450935791
CN           China    Beijing  1419321278
GB  United Kingdom     London    69138192


In [13]:
# save the data as a csv file for the latter exercises
countries.to_csv("data/countries.csv")

Putting data in a dictionary and then building a DataFrame is not very efficient while dealing with millions of observations. A DataFrame can also be created by reading data from a source file where the data is typically available with a regular structure. An example is the CSV file, which is short for "comma-separated values".

In [14]:
# read the csv file
df = pd.read_csv('data/countries.csv', index_col = 0)

# print out the tabular data
print(df)

           Country    Capital  Population
DE         Germany     Berlin    84552242
AU       Australia   Canberra    26713205
JP           Japan      Tokyo   123753041
IN           India  New Delhi  1450935791
CN           China    Beijing  1419321278
GB  United Kingdom     London    69138192
