# Data Sourcing or Getting and cleaning data 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Introduction

To perform any kind of data science or data analysis, we need data. No data means no data science. Before performing any kind of data analysis on the data, it is important to have data ready for data analaysis since generally the data that we extract is not in a neat and tidy format that we can perform analysis on and we need to get our data ready for the same. In this section, we will go through the following - 
* Finding and extracting raw data.
* Tidy data principles and how to make data tidy - taking the raw data and formating into tidy data format to make it easy to use.
* Practical implementation using Python


In real world we might have data in many forms, it could be neatly organized in an excel file, it could be a log text file data from which we need to extract some substrings, JSON data from API's and databases. Also the data can exist at multiple places or in multiple formats and we will have to collect all that data and format that to create a format which can be used for data analysis.

The basic pipeline that we will have for data cleaning is - ` Raw Data -> Processing Script -> Tidy Data`

## Raw vs Processed data

Raw data 
* The original source of data 
* Often very hard to use in data analysis 
* Data analysis includes data processing/cleaning
* Raw data may only need to be processed once but regardless on how many times the data is processed, we need to keep a record of all kinds of data processing performed 

Processed data - 
* Data is ready for analysis
* Processing can include merging, subsetting, transforming, etc.
* There may be standards of processing
* **All steps should be recorded** - It is V.V.Important

## Tidy Data

Tidy data is the data that we are actually performing the data analysis on. We transform the raw data that we read from various sources to tidy data using some data processing methods. After moving from raw data to tidy data, we should have the following
1. The raw data 
2. The tidy data
3. Codebook describing each variable in the tidy dataset and it's value, it can also be described as the metadata of the tidy data. It explains the structure of the tidy data
4. An explicit and exact recipe to go from step 1 to step 2&3, we need an exact sequence of steps performed to get the tidy data and the codebook from the raw data.

### Raw Data Definition

The raw data is the rawest form of data that is available, i.e., unformatted and unprocessed data that we got directly from data sources like API, DB, excel sheet, raw text file, etc. We can say a raw data is raw if it has the following characterestics - 
* **No software has been executed on the data** - Apart from software or code executed to get the raw data, no other software or code or any other processing has been done on the data
* No numbers have been manipulated
* No data is removed from the dataset
* data is not summarized in any way

### Tidy Data Definition

This is the target or end data that we want
1. Each variable measured should be in a exactly one column
2. Each different observation should be in a different row
3. There should be 1 table for every kind of variable
4. If Multiple tables are present, there should be a single column which helps each table link.

It is better if the tidy data has the following -
* A row should be at the top of the data set containing the variable names for each column
* The variable names should be human readable
* Data should be saved in 1 file per table

### The Code Book


1. Information about the variables(including units) in the data set not contained in the tidy data, i.e., some kind of metadata about the variables which can help develop more understanding about the data
2. Information about the summary choices made. For eg, whether mean or mediam was taken for a particluar column data 
3. Information about experimental study design - info about the way data was collected

Some other important points - 
* Common format of document is word/text/markdown file
* there should be a section called "Study Design" containing thorough description on how data was collected - It should say things like how it was decided what observation to collect, what is extracted from data base and what is not
* A section called "Code Book" should be present containing all the variables and the data types.

### The Instruction List

We should be, at any time, be able to go back to the raw data and re process it to get the same tidy dataset. If it does not happen, then something is incorrect with the data processing pipeline and needs to be fixed.
1. Ideally a computer script will perform this job
2. Input for script is raw data
3. Output for script is tidy data
4. The script does not contain any parameters - Everything is fixed once initial data processing is done and there is no tweaking or modifying anything in the script. This will make sure that the behaviour of the script remains consistent.

In some cases it is not possible to document each and every step in the script or sometimes we might not be able to create a script for every step and some manual operations need to be performed. In that case, the instructions should be provided like - 
1. Step 1 - Take the raw file, select the version of the software to be executed and set the parameter values (if any)
2. Step 2 - Execute the software separately for each sample
3. Step 3 - The column returns corresponds to any specific column in the final data

While creating this, providing all the information is crucial, whatever information we can provide to the audience to let them know how the data was processed is beneficial.

## Downloading Data

Generally we will be downloading data from some location for us to be able to use it.

### Checking and creating directory

Before downloading data, we need to create a path where we will keep all out data. For that, we need to check if our data folder is present or not, and if it is not present, create it. `os.path.isdir('data')` checks if a folder is present or not and returns `True` if a folder exists and `False` if it does not exist. `os.mkdir('data')` creates a folder if it does not exist.
```python
import os
if not os.path.isdir('data'):
    os.mkdir('data')
```

### Downloading a file from the web

Now that we have created a data directory folder, next job is to get the data from the internet. We will use the [Baltimore Fixed Speed Cameras](https://data.baltimorecity.gov/Transportation/Baltimore-Fixed-Speed-Cameras/dz54-2aru) dataset for this example. Refer to the image to get the download URL 

<img src="images/data_science-baltimore_dataset.png" width="80%">

```python
import requests
fileUrl = "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD&bom=true&format=true"
r = requests.get(fileUrl)
with open('./data/cameras.csv', 'wb') as f:
    f.write(r.content)
```

We can list the files in the directory to check whether the download is completed or not
```python
import os
os.listdir('data')
```

We can also check the date and time at the time of download to know that when the data file has been downloaded
```python
from datetime import datetime
dateDownloaded = datetime.now()
print(dateDownloaded)
```

It might take some time to download if the file is large. Also make sure to record when the file was downloaded.

## Reading local flat files

[Flat file](https://searchsqlserver.techtarget.com/definition/flat-file) are files which do not have any inherent structural interrelationship.

Generally data is available in the following flat files formats - 
* CSV
* Excel
* XML
* JSON

Refer to [Reading Data](http://localhost:8888/lab/tree/Learning/Data%20analysis/Reading%20data.ipynb) in Python notebook for more info on how to read, write and manipulate data for these file formats.

## Reading from data storages

Apart from local flat files, we generally have [data storages](https://en.wikipedia.org/wiki/Data_storage) where all the data is present and we need to read from those. Some of them are as follows -
* MySQL
* HDF5
* Web
* API
* Other sources like different DBs; zipped files(like gz or bz); other softwares like SAS, SPSS,  etc; images like jpeg, png; GIS data; musics data; etc.

Refer to [Reading Data](http://localhost:8888/lab/tree/Learning/Data%20analysis/Reading%20data.ipynb) in Python notebook for more info on how to read, write and manipulate data for these data storages.

## Data Cleaning

Refer to [Data analysis notebook](http://0.0.0.0:8888/lab/tree/Learning/Data%20analysis/Data%20cleaning%20and%20visualization.ipynb)  to see how we can perform data cleaning.