**Note**: the following document includes explanations and code but it is not designed to stand alone; they are the notes whith which the instructor will facilitate the workshop, and thus incomplete without further modifications/clarifications/explanations. Use it to familiarise yourself with some of the notions and with how the code looks. You will create your own document as you follow along in your own machine and type your own code on the workshop days. If there are bits that don't work, don't worry, they are on purpose and you'll learn why.

<hr>

# Data Wrangling in Python

## Overview (what to expect)

### Data Wrangling

<ul>
    <li><h4>Part One: Getting Ready</h4></li>
    <ul>
        <li>Understanding what "data" and data "wrangling" are</li>
        <li>Importing libraries, reading files, and pandas dataframes</li>
        <li>Writing python: assignments, variables, properties, and methods</li>
        <li>A word on "types": python types and pandas types</li>
        <li>Recapitulation</li>
    </ul><br>
    <li><h4>Part Two: Data Preparation</h4></li>
    <ul>
        <li>Exploring the data: basic insights</li>
        <li>Working with missing data: errors, omissions, variations</li>
        <li>Data types</li>
        <li>Comparing data: finding a standard</li>
        <li>Data Binning: rough grouping</li>
        <li>Dummy variables: from categories to numbers</li>        
    </ul>
</ul>

## What is "Data"?

The idea of measurement: anything "measured" is measured *by someone and by something* and *for someone and for something*.

What one obtains after measuring is "given" for a subsequent use or goal.

This "given" is what "data" means: *datum* (Latin) means "what is given": **data is what is "given"; it is what we obtain when measuring something that is subsequently given for a particular use or purpose**.

There is no out-of-nowhere "data": it is always a perspective on something, and it is already oriented towards something by being measured and its meaning is furthered and shaped by what it will be used for.

Example: measuring a plot of land.

![Medir-un-terreno.jpeg](attachment:Medir-un-terreno.jpeg)
from: https://www.jcs.pe/cuando-medir-un-terreno-con-topografo/medir-un-terreno/


## What is "Data Wrangling"?

Data wrangling = data cleaning = data pre-processing = data preparation

Data wrangling is the *curating* of data: it is the preparation of what was measured so it is accessible.

It is a way of adjusting post-facto (post-measuring) the measuring procedure, with a clearer view of the purpose or use of what was measured.

Accessible = it can be handled, ordered, investigated, presented

Example: Pablo the florist.

![image.png](attachment:image.png)

from: https://www.istockphoto.com/photo/male-florist-working-at-the-flower-shop-gm1184231042-333268215

**Doing data wrangling means to cut, alter, modify, *prepare* the measurements we have been given (i.e., the "data") with the view of subsequently ordering, arranging, and presenting them in a particular way.**

Itinerary:

Monday: data curation and preparation

Tuesday: data presentation (visually)

With Pablo (the florist) we learn how to prepare the flowers (Monday) and how to arrange them and present them (Tuesday).

## Preparing our instruments: importing a library, reading a file

### What is a library?

A library is a storehouse of pre-programmed methods that we can use. Don't worry about the notation or if you don't know exactly how what you're doing works, it becomes clearer through repeated use.

In [None]:
import numpy as np
import pandas as pd

The format is <code>import</code> *name of library* <code>as</code> *shotrened name you'll be using in your code*.

So <code>import numpy as np</code> means: import library called "numpy" to python, and in my code I'll use the name "np" to call its methods and properties.

Now let us read a file:

In [None]:
df_elections = pd.read_csv('Election.Results.csv')

Here we are creating a variable and assigning something to it. We'll talk more of variables and assignments below.

The command <code>pd.read_csv()</code> reads a .csv file from a given address.

We can read other types of files by analogous instructions <code>pdf.read_excel()</code> to read excel files or <code>pdf.read_json()</code> to read json files, etc.

Let us now see some of the data:

In [None]:
df_elections.head()

In [None]:
df_elections.tail()

### On Dataframes in Pandas

Pandas receives data files in dataframes. This means that we need to learn to use the data we read in the "dataframe" mode. We thus will learn to use methods and properties from the dataframe repertoir.

### Writing (to) Python

Before moving on to exploring pandas dataframes, we will talk about how to write/code in python well. For this we have to know some of the main processes we will be performing, such as *assigning values* to a *variable*, using the *properties* or *methods* of a given library.

Let us quickly get an introduciton to these ideas.

### What is an assignment?

It is what the "=" sign stands for: it has a direction from right to left (<=). It says "what is on my right I assign to what is on my left." For example: <code>df_test = pd.read_csv('test_file.csv')</code> is a statement that assigns the <code>test_file.csv</code> to the variable <code>df_test</code>. But what is a variable?

### What is a variable?

A variable is a way of locating something with a name that Python can recognise. It is a way of allocating a localisable (hence the name)  space to a value, file, or object, so we can reference it with the name given.

### What is a property?

A property or an attribute is a characteristic of objects. We will use them in particular with pandas, numpy, matplotlib, and other libraries. They normally look like <code>name_of_object</code> <code>.</code> <code>name_of_property</code> (the dot is crucial). An example would be <code>df_test.shape</code>, where "df_test" would be a pandas dataframe.

### What is a method?

Finally, a method looks very much like a property or attribute, but it has parenthesis "()". They may be empty or they may have something inside. We have been using the method <code>.read_csv('file_path/file_name.csv')</code> to read files. As you see, this method does need a parameter inside. An example of a method that does not need parameters (although you may introduce them) is <code>.head()</code>

### Example: Finding all the "small towns"

Let us have a look at our population file and apply the notions we have just learnt.

In [None]:
# df_population = pd.read_csv('lauth-classification-csv.csv') 
df_population = pd.read_csv('data/lauth-classification-csv.csv')

oops.. something went wrong... 

### A word on paths and how to manage them

At times it may be confusing to call a file and read it, especially if we don't have the file we are working on (say this jupyter notebook) in the same folder as the file we want to read.

But don't dispair: the "os" library may help us. Let us import and print our current working directory (cwd).

```python
import os
os.getcwd()
```
*Note*: we don't use <code>import os as</code> <code>alias</code> (shorten alias) because "os" is already short and handy. Unless we are using "os" as an alias of something else, we can stick to its "full name."

In [None]:
import os
os.getcwd()

In [None]:
#to get the list of files in our working directory

os.listdir(os.getcwd())

In [None]:
# example of fixing the path and then working with the data
# This path is an example - each individual will have their own distinct path to fetch the files,
# or they'll need no path but the file name if they save the data sets in the same folder as the jupyter notebook.

#df_population = pd.read_csv('ds/ts/lauth-classification-csv.csv')

df_population = pd.read_csv('lauth-classification-csv.csv')

df_population.head()

In [None]:
df_population.columns

In [None]:
c_head = ["local-authority-code", "local-authority-name", "classification", "population", "percentage-local-authority"]

df_population.columns = c_head

df_population.columns

In [None]:
#To change just a one or more column names but not all

df_population.rename(columns={'local-authority-code':'la-code', 'local-authority-name':'la-name'}, inplace = True)

df_population.columns

In [None]:
# Targeting a column with a particular condition

df_population[df_population['classification'] == 'Small Town']

Note on == and other comparative operators like !=, <=, <, >=, >. These operators return *boolean* values.

A boolean value is a type that is either True or False.

For example: 

``` 
    a = 10
    b = 10
    print(a >= b)
```

The code above prints the boolean value of a being greater than or equal to b (which is True).

In [None]:
## Adding more conditions: all the small towns with more than ten thousand people

df_population[(df_population['classification'] == 'Small Town') & (df_population['population'] > 10000)]

# A note on conditions: "and", "or" is the normal Python way. In Pandas we use "&", "|" (vertical line).

In [None]:
# Let's return the first two columns' names to their name "local-authority-"

df_population.rename(columns = {'la-code':'local-authority-code', 'la-name':'local-authority-name'}, inplace = True)

# Careful with "inplace = True", it makes irreversible changes in the dataframe

df_population.head()

## Exploring the Data

Getting basic insights from the data. Some of the instructions we have already seen. Let's have a look at some new ones and what they do.

In [None]:
df_elections.info(verbose = False)

In [None]:
df_elections.info()

In [None]:
type(df_elections.columns)

In [None]:
type(df_elections.columns.values)

In [None]:
type(df_elections.columns)

In [None]:
df_elections.index

In [None]:
df_elections.shape

In [None]:
#To see the total of null elements in each column

df_elections.isnull().sum()

In [None]:
df_elections.describe()

Note that it includes only the numerical columns. If we say ```df_elections.describe(include = all)``` it will include even the columns of type ```object```.

Talking about types: usual types in python like ```integer```, ```float```, ```string``` are in pandas ```int64```, ```float64```, ```object```.

In [None]:
# We can also describe a portion of columns by df_elections[list of columns here]
# list_col = ['column1', 'column2']
#df_elections[list_col].describe()

In [None]:
df_elections.head(2)

In [None]:
#change index to one of the columns

df_elections.set_index('ons_id', inplace = True)

In [None]:
df_elections.head()

In [None]:
df_elections.reset_index()

In [None]:
df_elections.set_index('constituency_name', inplace = True)

In [None]:
df_elections.head()

In [None]:
df_elections.loc['Aberavon']

In [None]:
parties = ['con', 'lab', 'ld', 'ukip', 'green','ukip','snp', 'pc', 'dup', 'sf', 'sdlp', 'uup', 'alliance', 'other']
areas = ['Aberavon', 'Aberdeen South']
df_elections.loc[areas, parties].plot()

...something looks weird.. what happened?

In [None]:
#Plotting a dataframe needs the horizontal values of the plot to be on the vertical of the dataframe

df_comparison = df_elections.loc[areas, parties]
df_comparison.transpose().plot(kind='line', figsize=(14,8))

### Dummy variables: from categories to numbers

In [None]:
df_elections.head(10)

Dummy variables allow us to turn categories that are usually type ````string```` or ````object```` into numerical categories.

In [None]:
dummy1 = pd.get_dummies(df_elections['constituency_type'])
dummy1.head(20)

In [None]:
dummy1 = pd.get_dummies(df_elections['result'])
dummy1.head(20)

In [None]:
dummy1.columns

In [None]:
df_elections.columns.values

In [None]:
dummy2 = pd.get_dummies(df_elections['first_party'])
dummy2.head(10)

In [None]:
#dummy2['Con'].sum()

In [None]:
#dummy2['LD'].sum()

In [None]:
#dummy2['Lab'].sum()

The dummy variables' results may be aggregated to the main dataframe and then the original columns may be substituted.