# Reading Files

Any File can be accesed an read. A Python file object is created when a file is opened with the `open()` function. You can associate this file object with a variable when you open a file using the `with` and `as` keywords.

In [None]:
with open('sample_data/README.md') as file_object:
  print(file_object)

Using method ``readlines()`` is an easy way to access a file content line by line.

In [None]:
with open('sample_data/anscombe.json') as file_object:
  lines = file_object.readlines()
  for line in lines:
    print(line)

## JSON File
JSON  stands for **J**ava**S**cript **O**bject **N**otation.

It is a lightweight format for storing and transporting data. It is self-describing, easy to understand and often used to send data from a server to a web page.

You can find further information in https://www.w3schools.com/js/js_json_intro.asp.

In Python, JSON data is read through the ``json`` module. The method ``load`` will map the content into a *dictionary*. 

In [None]:
import json
  
# Opening JSON file and read data
file = open('sample_data/anscombe.json')
data = json.load(file)

# Iterating through the json
for i in data:
  print(i)

for i in data[0]:
  print(i, data[0][i])

# Closing file
file.close()

If we have some JSON data contained in a string variable, we will use the method ``loads`` instead.

In [None]:
import json
  
json_string = "{\"who\": \"me\", \"when\": \"now\"}"
data = json.loads(json_string)
  
# Iterating through the json
for i in data.keys():
  print(i, ":", data[i])

## CSV File
CSV stands for **C**omma **S**eparated **V**alues.

A CSV file is a text file that has a specific format which allows data to be saved in a table structured format.

In Python, CSV data is read through the ``csv`` module. The method ``reader`` will map the content of each line into an list.

In [None]:
import csv

with open('sample_data/california_housing_test.csv') as file:
  csv_reader = csv.reader(file, delimiter = ',')
  lines = 0
  for row in csv_reader:
    if lines == 0:
      print("Column names are:", ", ".join(row))
    else:
      print(row)
    lines += 1
    if lines >= 10:
      break
print(f'Processed {lines} lines.')

A more handy way to read a CSV file is to use the ``DictReader`` method which converts each data row into a *dictionary*.

In [None]:
with open('sample_data/california_housing_test.csv') as file:
  csv_reader = csv.DictReader(file, delimiter = ',')
  lines = 0
  for row in csv_reader:
    print(row["latitude"], row["longitude"])
    lines += 1
    if lines >= 10:
      break
print(f'Processed {lines} lines with data.')

# Pandas
**Pandas** is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the **Python** programming language. Everything about pandas can be found at https://pandas.pydata.org/.

It supports row and column metadata so any operation ment for tables can be acomplished with Pandas. There is a connector almost from any data source to Pandas, so it is suitable for data cleansing, combining and filtering.


## Series and Dataframes

The basic Pandas objects to work with data are ``Series`` and ``Dataframes``.

You can think of a ``Series`` as the column of a table.

In [None]:
import pandas as pd

data = [1, 7, 2]
s = pd.Series(data)
s

As you can see, by default, values are labeled with its index number. But if we use a ``dictionary`` as the `Series` data, labels are set.



In [None]:
import pandas as pd

steps = {"day1": 42045, "day2": 38045, "day3": 43390}
s = pd.Series(steps)
s

Data sets in Pandas are usually multi-dimensional tables, called `Dataframes`. A Dataframe may be built by joining series.

More info at https://pandas.pydata.org/docs/reference/frame.html.

In [None]:
import pandas as pd

steps = {"day1": 42045, "day2": 38045, "day3": 43390}
s = pd.Series(steps)
calories_intake = {"day1": 2051, "day2": 1945, "day3": 3390}
c_in = pd.Series(calories_intake)
calories_gym = {"day1": 0, "day2": 395, "day3": 698}
c_gym = pd.Series(calories_gym)

df = pd.DataFrame()
df['steps'] = s
df['cals_in'] = c_in
df['cals_gym'] = c_gym
df

## Data from lists
Let's create a dataframe with a couple of lists in a disctionary.

In [None]:
import pandas as pd

data = {
  "longitude": [30.23423, 25.98798, 13.00974],
  "latitude": [15.09884, 12.02374, 16.98732]
}
df = pd.DataFrame(data)
df

## Data from CSV
Let's read a CSV file and manipulate its data with pandas.

In [None]:
import pandas as pd

df = pd.read_csv('sample_data/california_housing_test.csv')
df

Yo can check in https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html a complete reference of what you can do with a dataframe. 


## Basic Operations

Let's check what does the dataframe contain.

In [None]:
print(df.shape)
print()
df.describe()

Let's **sort** the `dataframe`.

In [None]:
df_sorted = df.sort_values(['housing_median_age', 'latitude'])
df_sorted

Let's **rearrange** data and keep just a few columns.

In [None]:
df_reshaped = df_sorted.reindex(columns = ['housing_median_age', 'latitude'])
df_reshaped

Or even easier.

In [None]:
df_reshaped_again = df_sorted[['housing_median_age', 'latitude', 'longitude']]
df_reshaped_again

We can also combine two Dataframes in different ways.

We may **concatenate** two or more dataframes, like in an SQL union.

In [None]:
import pandas as pd

data1 = {
  "id": [1, 2, 3],
  "name": ["John", "Mike", "Troy"]
}
data2 = {
  "id": [10, 11, 3],
  "name": ["Steve", "Ray", "Troy"]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)

# check what happen with and without ignore_index
pd.concat([df1, df2], ignore_index = True)

To join data with different column names

In [None]:
import pandas as pd

data1 = {
  "id": [1, 2, 3],
  "first name": ["John", "Mike", "Troy"]
}
data2 = {
  "id": [10, 11, 12],
  "last name": ["Rambo", "Tyson", "Aikman"]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)

# check what happen with and without ignore_index
pd.concat([df1, df2], axis = 0, ignore_index = True)

We can also **merge** dataframes like in an SQL join.

In [None]:
import pandas as pd

data1 = {
  "id": [1, 2, 3],
  "first name": ["John", "Mike", "Troy"]
}
data2 = {
  "id": [1, 2, 4],
  "last name": ["Rambo", "Tyson", "Aikman"]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# how can take 'inner', 'left', 'right' or 'outer'
pd.merge(df1, df2, on = 'id', how = 'outer')

## Transformations
Let's see how to do some easy transformations with pandas Dataframes.  

### Cleaning
We can remove empty values

In [None]:
import pandas as pd

data = {
  "id": [1, 2, 3],
  "first name": ["John", None, "Troy"]
}
df = pd.DataFrame(data)
print(df)
df.dropna()

We can remove duplicated rows

In [None]:
import pandas as pd

data = {
  "id": [1, 2, 2],
  "first name": ["John", "Mike", "Mike"]
}
df = pd.DataFrame(data)
print(df)
df.drop_duplicates()

Or we can remove data we don't need

In [None]:
import pandas as pd

data = {
  "id": [1, 2, 3],
  "first name": ["John", "Mike", "Troy"]
}
df = pd.DataFrame(data)
print(df)

for x in df.index:
  if df.loc[x, "id"] == 1:
    df.drop(x, inplace = True)

df

###Replacing

We can replace empty values

In [None]:
import pandas as pd

data = {
  "id": [1, 2, 3, None],
  "first name": ["John", None, "Troy", "Steve"]
}
df = pd.DataFrame(data)
print(df)

df.fillna('NOTHING HERE')

Or better

In [None]:
import pandas as pd

data = {
  "id": [1, 2, 3, None],
  "first name": ["John", None, "Troy", "Steve"]
}
df = pd.DataFrame(data)
print(df)

# see what happens with inplace = True
df["id"].fillna(0, inplace = True)
df["first name"].fillna("Someone", inplace = True)
df

We can transform data.

In [None]:
import pandas as pd

data = {
  "id": [1, 2, 3, 4],
  "first name": ["John", "Mike", "Troy", "Steve"],
  "birth_date": ["2020-01-10", "2017-06-20", "2010-11-13", "2020-11-30"]
}

df = pd.DataFrame(data)
print(df.info())
print()
df

In [None]:
df['birth_date'] = pd.to_datetime(df['birth_date'])
print(df.info())
print()
df

Or we can add columns based on existing ones

In [None]:
import pandas as pd

data = {
  "id": [1, 2, 3, 4],
  "first name": ["John", "Mike", "Troy", "Steve"],
  "birth_date": ["2020-01-10", "2017-06-20", "2010-11-13", "2020-11-30"]
}
df = pd.DataFrame(data)
print(df)
print()

df['birth_date_day'] = pd.to_datetime(df['birth_date']).dt.day
df['birth_date_month'] = pd.to_datetime(df['birth_date']).dt.month
df['birth_date_year'] = pd.to_datetime(df['birth_date']).dt.year
df['zero'] = 0
df

###Extracting
As you have already seen, you can extract just a few columns in a dataframe by passing the list of the fields you need.

In [None]:
import pandas as pd

data = {
  "id": [1, 2, 3, 4],
  "first name": ["John", "Mike", "Troy", "Steve"],
  "birth_date": ["2020-01-10", "2017-06-20", "2010-11-13", "2020-11-30"]
}
df = pd.DataFrame(data)
df[['id', 'birth_date']]

With ``head()`` you can select just the first N rows

In [None]:
import pandas as pd

data = {
  "id": [1, 2, 3, 4],
  "first name": ["John", "Mike", "Troy", "Steve"],
  "birth_date": ["2020-01-10", "2017-06-20", "2010-11-13", "2020-11-30"]
}
df = pd.DataFrame(data)
df.head(2)

Or you can pass a condition to obtain just a set of rows

In [None]:
import pandas as pd

data = {
  "id": [1, 2, 3, 4],
  "first name": ["John", "Mike", "Troy", "Steve"],
  "birth_date": ["2020-01-10", "2017-06-20", "2010-11-13", "2020-11-30"]
}
df = pd.DataFrame(data)
df[df["id"] >= 3]

###Grouping
Data can be grouped and 

In [None]:
import pandas as pd

data = {
  "score": [14, 21, 38, 45, 34, 74, 2, 44, 23, 9, 81, 43],
  "player": ["John", "Mike", "Troy", "Steve", "John", "Mike", "Troy", "Steve", "John", "Mike", "Troy", "Steve"],
  "date": ["2020-01-10", "2020-01-10", "2020-01-10", "2020-01-10", 
                 "2020-01-11", "2020-01-11", "2020-01-11", "2020-01-11", 
                 "2020-01-12", "2020-01-12", "2020-01-12", "2020-01-12"]
}
df = pd.DataFrame(data)
print(df)
print()
print(df.groupby("player")["date"].count())
print()
print(df.groupby("player")["score"].sum())

All together using ``numpy`` (see next section).

In [None]:
import numpy as np

df.groupby("player").agg({"date": np.size, "score": np.sum })

## Pivoting
A pivot table is a statistics tool that summarizes and reorganizes selected columns and rows of data in a table to obtain a desired report. The tool does not actually change the table itself, it simply _pivots_ or turns the data to view it from a **different perspective**.

When creating a pivot table, there are three main components:

- **Columns**: When a field is chosen for the column area, only the unique values of the field are listed across the top.
- **Rows**: When a field is chosen for the row area, it populates as the first column. Similar to the columns, all row labels are the unique values and duplicates are removed.
- **Values**: Each value is kept in a pivot table cell and display the summarized information. The most common values are sum, average, minimum and maximum.


In [None]:
import pandas as pd

# a list of order lines with its user
df = pd.DataFrame(
  {
    'order_id': ['0001', '0001', '0002', '0002', '0002', '0003', '0004', '0004'],
    'user_name': ['john', 'john', 'mike', 'mike', 'mike', 'tony', 'mike', 'emma'],
    'product_id': [1, 2, 2, 4, 6, 2, 1, 6],
    'amount': [1, 1, 2, 4, 1, 1, 3, 2]
   }
)
print('Original')
print(df)
print()
print('Pivots')

# different pivots
df.pivot(columns = 'user_name')

In [None]:
df.pivot(columns = 'user_name', values = 'amount')

In [None]:
df.pivot(columns = 'user_name', values = 'amount', index = 'product_id')

## Correlation
**Correlaton** refers to the interdependence of quantities expressed as a float number between ``-1.0`` and ``1.0`` depending whether the correlaton is perfect and negative or positive. Correlation can be calculated in two ways:
- between columns of a dataframe
- between two dataframes

In [None]:
import pandas as pd

df = pd.DataFrame(
  {
    'age': [11, 12, 12, 4, 16, 22, 31, 16, 33, 18, 20, 20, 23, 4, 13],
    'occurrences': [1, 1, 2, 4, 1, 1, 3, 2, 4, 2, 3, 0, 1, 3, 4]
   }
)
print(df.corr(min_periods = 1, method = 'pearson'))
print()
print(df.corr(min_periods = 5, method = 'kendall'))
print()
df.corr(min_periods = 10, method = 'spearman')


In [None]:
import pandas as pd

df1 = pd.DataFrame(
  {
    'age': [11, 12, 12, 4, 16, 22, 31, 16, 33, 18, 20, 20, 23, 4, 13],
    'occurrences': [1, 1, 2, 4, 1, 1, 3, 2, 4, 2, 3, 0, 1, 3, 4]
   }
)
df2 = pd.DataFrame(
  {
    'age': [11, 12, 12, 14, 16, 22, 31, 16, 23, 18, 20, 20, 23, 14, 13],
    'occurrences': [3, 1, 1, 2, 4, 1, 1, 3, 2, 4, 2, 3, 0, 1, 3]
   }
)
df1.corrwith(df2, axis = 0)

## Rolling
A rolling window model is used to check the stability of a model, commonly over a time series. We usually stablish the size of a **time window** where we calculate the mean and compare against the value.

_(*) In a Python dataframe, we can calculate rolling without a time index._

_(**) And we can apply different calculations on the window values._

In [None]:
import pandas as pd

df = pd.DataFrame(
  {
    'occurrences': [11, 12, 12,  4, 16, 22, 31, 16, 33, 18,
                    20, 23,  4, 13, 12, 51, 23, 27,  9, 15, 
                    34,  8, 23, 17,  9, 30, 20, 12, 21, 18, 9]
  },
  index = ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05", "2023-01-06", 
           "2023-01-07", "2023-01-08", "2023-01-09", "2023-01-10", "2023-01-11", "2023-01-12", 
           "2023-01-13", "2023-01-14", "2023-01-15", "2023-01-16", "2023-01-17", "2023-01-18", 
           "2023-01-19", "2023-01-20", "2023-01-21", "2023-01-22", "2023-01-23", "2023-01-24", 
           "2023-01-25", "2023-01-26", "2023-01-27", "2023-01-28", "2023-01-29", "2023-01-30", "2023-01-31"]
)
df['rolling_occ'] = df['occurrences'].rolling(7, min_periods = 1, center = True).mean()
df

##Plotting
Although there are better alternatives for drawing information in graphs, we can use the ``pandas`` library to plot data.

Visit https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html for further info.

In [None]:
import pandas as pd

df = pd.DataFrame(
  {
    'occurrences': [11, 12, 12,  4, 16, 22, 31, 16, 33, 18,
                    20, 23,  4, 13, 12, 51, 23, 27,  9, 15, 
                    34,  8, 23, 17,  9, 30, 20, 12, 21, 18, 9]
  },
  index = ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05", "2023-01-06", 
           "2023-01-07", "2023-01-08", "2023-01-09", "2023-01-10", "2023-01-11", "2023-01-12", 
           "2023-01-13", "2023-01-14", "2023-01-15", "2023-01-16", "2023-01-17", "2023-01-18", 
           "2023-01-19", "2023-01-20", "2023-01-21", "2023-01-22", "2023-01-23", "2023-01-24", 
           "2023-01-25", "2023-01-26", "2023-01-27", "2023-01-28", "2023-01-29", "2023-01-30", "2023-01-31"]
)
df.plot.area()
df.plot.line()
df.plot.bar()

# What's next?
Let's learn how to use some libraries:

- NumPy
- SciPy
- MatPlotLib
- Seaborn
- StatsModel
- ScikitLearn