# Pandas

**Pandas** is a Python library used for working with datasets. It does that by helping us make sense of **DataFrames**, which are a form of 2D **structured data**, like a table with columns and rows.

# Importing Data

## CSV Files

CSV stands for Comma Separated Values, and it's a file type that allows data to be saved in a table. Data presented in a table is called **structured data**, because it adheres to the idea that there is a meaningful relationship between the columns and rows.


## Dictionaries

Creating a DataFrame from Python dictionary using `from_dict` function

In [4]:
import pandas as pd

data = {"col_0": [4,5,6,7], "col_1": ["e", "f", "g", "h"]}
pd.DataFrame.from_dict(data)

Unnamed: 0,col_0,col_1
0,4,e
1,5,f
2,6,g
3,7,h


By default, DataFrame will be created using keys as columns. Note the length of the values should be equal for each key for the code to work. We can also let keys to be index instead of the columns:

In [5]:
pd.DataFrame.from_dict(data, orient="index")

Unnamed: 0,0,1,2,3
col_0,4,5,6,7
col_1,e,f,g,h


We can also specify column names:

In [6]:
pd.DataFrame.from_dict(data, orient="index", columns=["E", "F", "G", "H"])

Unnamed: 0,E,F,G,H
col_0,4,5,6,7
col_1,e,f,g,h


# Practice

Try it yourself! Create a DataFrame called using the dictionary `clothes` and make the keys as index, and put column names as ['color','size']

In [8]:
clothes = {"shirt": ["red", "M"], "sweater": ["yellow", "L"], "jacket": ["black", "L"]}

pd.DataFrame.from_dict(clothes, orient="index", columns=["color", "size"])

Unnamed: 0,color,size
shirt,red,M
sweater,yellow,L
jacket,black,L


# JSON Files

JSON is short for JavaScript Object Notation. It is another widely used data format to store and transfer the data. It is light-weight and very human readable. In Python, we can use the `json` library to read JSON files. Here is an example of a JSON string.

In [9]:
info = """{
    "firstName" : "Cynthia",
    "lastName" : "Koskei",
    "hobby" : "swimming",
    "age" : 24
}"""
print(info)

{
    "firstName" : "Cynthia",
    "lastName" : "Koskei",
    "hobby" : "swimming",
    "age" : 24
}


In [10]:
#loading the json string into Python dict
import json

data = json.loads(info)
data

{'firstName': 'Cynthia', 'lastName': 'Koskei', 'hobby': 'swimming', 'age': 24}

In [11]:
data['lastName']

'Koskei'

A dictionary may not be as convenient as a `DataFrame` in terms of data manipulation and cleaning. But once we've turned our json string into a dictionary, we can transform it into a `DataFrame` using the `from_dict` method.

In [12]:
df = pd.DataFrame.from_dict(data, orient="index", columns=["User 1"])
df

Unnamed: 0,User 1
firstName,Cynthia
lastName,Koskei
hobby,swimming
age,24


# Practice

Try it yourself! Load the JSON file `clothes` and then transform it to `DataFrame`, name column properly.

In [16]:
clothes = """{"shirt": ["red","M"], "sweater": ["yellow","L"]}"""

data = json.loads(clothes)
df = pd.DataFrame.from_dict(data, orient="index", columns=["color", "size"])
df

Unnamed: 0,color,size
shirt,red,M
sweater,yellow,L


# Load Compressed file in Python

In the big data era, it is very likely that we'll need to read data from compressed files. One way to unzip the data is to use gzip. We can load the `file.json.gz` file from the data folder using the following code:

In [None]:
import gzip
import json

with gzip.open("data/file.json.gz", "r") as f:
    file_data_gz = json.load(f)

`file_data_gz` is a dictionary, and we only need the `data` portion of it.

In [None]:
 file_data_gz.keys()

We can use the `from_dict` function from pandas to read the data:

In [None]:
df = pd.DataFrame().from_dict(file_data_gz["data"])

In [None]:
df.head()

# Pickle Files

Pickle in Python is primarily used in `serializing` and `deserializing` a Python object structure. `Serialization` is the process of turning an object in memory into a stream of bytes so you can store it on disk or send it over a network. `Deserialization` is the reverse process: turning a stream of bytes back into an object in memory.

According to the pickle module documentation, the following types can be pickled:

* `None`
* Booleans
* Integers, long integers, floating point numbers, complex numbers
* Normal and Unicode strings
* Tuples, lists, sets, and dictionaries containing only objects that can be pickled
* Functions defined at the top level of a module
* Built-in functions defined at the top level of a module
* Classes that are defined at the top level of a module

In [1]:
clothes = {"shirt": ["red", "M"], 
           "sweater": ["yellow", "L"],
          "jacket": ["black", "L"]}
clothes

{'shirt': ['red', 'M'], 'sweater': ['yellow', 'L'], 'jacket': ['black', 'L']}

In [None]:
import pickle 

pickle.dump(clothes,open("./data/clothes.pkl", "wb"))