# Data science: a brief introduction

## Some definitions

**Data science:**
*An inter-disciplinary field that extracts knowledge and insight from many data*

**Data journalism:**
*An inter-disciplinary field which combines data science principles with journalism to tell stories with data*

## Agenda

**We'll start with a crash course in Python basics and learn about:**
1. The print command
2. Data types
3. Variables
4. Lists
5. Loops
6. Conditional logic
7. Functions
8. Libraries

**Then, we'll use our skills to clean a dataset**

**Lastly, we'll make a graph**


## 1. The Print Command
Using the Jupyter notebook, we can write lines of code and receive feedback. Let's try it out!

In [None]:
print('Hello world!')

Above, we have a simple command which will print whatever is inside the <span style = "color:red">**( )**</span>. Let's try running it by pressing SHIFT + ENTER.

<hr>
Notice that inside the **( )**, we use <span style="color:red">**' '**</span> to surround our words. What happens when we don't use them? Let's try it out below.

In [None]:
print(37)

In [None]:
print(Hello Danwatch!)

<hr>
Whole numbers like **37** are called integers.

One or more words is called a **string**. To let Python know we're trying to print words, we need to use **' '** or **" "** to enclose our words. Let's fix our example above.

<hr>

## 2. Datatypes in Python
Common datatypes are:
* **str** : (Strings) these are words, like "hello world"
* **int** : (Integer) these are integers, like 1, 2, 3, 4, and 5
* **float** : these are not whole numbers, like 1.25, 4.56, 2.34
* **list** : These are lists of items such as \[ 1, apple, 30, 2.23 \] or \[ rubber, cocoa, plastic, wood \]

<hr>

## 3. Variables
Variables store information in Python. For example, take **a** and **b** in the example below:

In [None]:
a = 1
b = 2

Let's print the variable **a**.

In [None]:
print()

<hr>

***Variables store data so we can do complex processes to them.*** This is what allows us to transform our data and display them.

The variable name is the part before the **=**. The value after the **=** is stored.

`a = 1` can be read as "`a` is set to `1`". We cannot say `a` is **equal** to `1`. Equal means something different in programming.

Let's start by trying some addition. What happens when you add **a** to **b**? Let's save it as a new variable called **c**.

In [None]:
c = 

print()

<hr>
You can also add strings together.

I've stored two strings as variables below. What happens when you add the variables together?

In [None]:
d = "Hello darkness"
e = "my old friend"

new_string =

print()

Spaces are also characters. Let's fix the sentence by adding a space.

<hr>

## Performing operations

We just added two strings together. We can also do the same with numbers. For numbers, we can also multiply, subtract, and divide.

It's easy to do: <span style="color:red">**new_string = d + e**</span>.

But what happens when we want to do the same thing, but 50 times? or 100? For example, if we have a spreadsheet, what if we want to go down one column and do the same operation over and over again? It would take a very long time.

Luckily, we can use <span style="color:blue">**lists**</span> and <span style="color:blue">**loop**</span> through them.

<hr>

## 4. Lists
Earlier, we learned about a dataype called a "list". Lists are important because we can imagine each column of your spreadsheet or table as a list.

![Screen%20Shot%202020-04-27%20at%2011.36.26%20PM.png](attachment:Screen%20Shot%202020-04-27%20at%2011.36.26%20PM.png)

<hr>

## 5. Loops
Loops are a way to go through a list, one item at a time, and apply the same change. We won't learn about loops in detail, but let's see how a loop works.
![Screen%20Shot%202020-04-28%20at%201.38.59%20PM.png](attachment:Screen%20Shot%202020-04-28%20at%201.38.59%20PM.png)

Instead of adding the items one at a time like below, we can tell the program to go through all the items in a list and apply the same change.
<hr>

## 6. Conditional logic

The other really power basis of programming is conditional logic. Here is a basic conditional statement:

<span style = "color:blue">**If the light is red, stop the car.**</span>

Conditional logic say, if the "if" statement is **TRUE**, do the second part of the statement.

<hr>

## Conditional logic, part 2
We can also combine more than one criteria:

* <span style="color:blue">**IF**</span> the light is red --> **STOP** the car.

* Or <span style="color:blue">**ELSE IF**</span> the light is green --> **DRIVE**.

* Or <span style="color:blue">**ELSE**</span> --> **SLOW DOWN**.

*Note: In programming, "else if" is abbreviated as `elif`.*

<hr>

## 7. Functions

A function is a way to ***use a piece of code over and over again***. 

It's a way to store a few commands that you can call up again and again.

In [None]:
def addition(first_item, second_item):
    return first_item + second_item

In [None]:
addition(1,2)

<hr>

## 8. Libraries

A library is a ***set of pre-defined functions and bits of code*** that make our lives easier.

Instead of writing everything from scratch, we can just use a module to perform an action.

Below, we will import a data analysis library called **pandas**. We'll use this in the next part of the training to open a spreadsheet.

In [None]:
import pandas as pd        #We gave the library a nickname 'pd' so it's easier to refer to.

<hr>
<hr>

# Part Two

Now that we have a very basic crash course in Python, let's see what we can do with it. Now, we'll work on:

* Cleaning our data

* Making a line chart

<hr>

## Why do we need to clean data?

Our data is often in spreadsheets. Often there are spelling mistakes, missing data, inconsistencies, and missing spaces.

Or, sometimes, we just want to make the same change to an entire row or column. If you use Excel or Google Sheets, this can take a long time, right?

Even if you don't want to make visualizations, **cleaning your data with Python can make it a lot faster**.

<hr>

## Denmark's external trade with organic products

Today, we'll work with some data from [Statistics Denmark](https://www.statbank.dk/).

The data covers Danish imports AND exports with Africa, Asia, Europe, North and South America, and Oceania from 2003-2018.
![Screen%20Shot%202020-04-28%20at%203.43.49%20PM.png](attachment:Screen%20Shot%202020-04-28%20at%203.43.49%20PM.png)

<hr>
Let's start by importing a library that will help us clean our data.

In [None]:
import pandas as pd

<hr>

## Comma-Separated Values
Now, let's open up our data file, **stats.csv**.

CSV stands for: **C**omma - **S**eparated **V**alues. 

This refers to how the information is stored.
![Screen%20Shot%202020-04-28%20at%207.21.12%20PM.png](attachment:Screen%20Shot%202020-04-28%20at%207.21.12%20PM.png)

<hr>

## Read our csv file into the program

Let's read our file into the program so we can work with it.

We'll store the file in a variable called **stats**.

In [None]:
stats = pd.read_csv('stats.csv')

We're familiar with the first part of this ( **stats =** ). But, what is the rest of it doing?

* <span style='color:blue'>**pd**</span> refers to the **pandas** library, which contains code that helps us do data analysis.
* <span style='color:blue'>**read_csv( )**</span> is a function in the pandas library which "reads" our CSV file into the program.

Once our file is read into the program, we will refer to it as a **dataframe**.

<hr>

## Functions (a quick review)

Remember, **a function is a way to use a piece of code over and over again**. 

It's a way to store a few commands that you can call up again and again. For example, here is a function, **<span style="color:blue">addition( )</span>**.

In [None]:
def addition(first_item, second_item):
    return first_item + second_item

addition(1, 3)

<hr>

## Use `info()` to get information about our dataframe
Let's get some information about our dataframe.

We'll use the function <span style="color:blue">**info( )**</span> to do this.

In [None]:
stats

<hr>

## Use `head()` to inspect the first few lines of your dataframe
Let's take a look at the first 5 lines of our dataframe.

We'll use the function <span style='color:blue'>**head( )**</span> to do this.

In [None]:
stats

Awesome! Our data is properly loaded into the program.

What are some things you notice about the data? Is there any way to make it nicer?

<hr>

One thing I noticed was how the Trade partner is written as **'EUROPE, TOTAL'**.

This would look nicer if it was just **'EUROPE'**.

Let's take a peek at the unique values in the 'Trade partner' column.

<hr>

## Select a column within the dataframe
We'll select this column by typing <span style="color:blue">**stats['Trade partner']**</span>

In English, we can understand this means:

**for the <span style="color:blue">stats</span> dataframe, select the <span style="color:blue">Trade partner</span> column**

In [None]:
stats['Trade partner']

<hr>

Now, let's take a look at the **unique** values for this column.

We will use the function <span style="color:blue">**unique ( )**</span>.

In [None]:
stats['Trade partner']

<hr>

## Use `str.replace()` to replace characters in a string

Let's fix this! We'll use a function called **<span style="color:blue">str.replace(</span> 'original', 'new' <span style="color:blue">)</span>**

We want to replace **', TOTAL'** with **''**.

In [None]:
stats['Trade partner'] = stats['Trade partner'].str.replace()

Let's see if it worked!

We will ask for the unique values using the **<span style="color:blue">unique( )<span>** function.

In [None]:
stats['Trade partner']

It worked! But, now I've noticed something else that bothers me: **NORTH- AND SOUTHAMERICA**.
Let's fix this!

<hr>

In [None]:
stats['Trade partner'] = stats['Trade partner'].str.replace

Let's check for the unique values again to see if it worked! 

*Hint: use the `unique()` function*

<hr>

The numerical columns should be okay, but let's check the other string column, **'Type'** to see if we can improve it.

We'll use the <span style="color:blue">**unique( )**</span> function again.

In [None]:
stats['Type']

This looks strange! Import and Imports should be the same. Just as Exports and Export should be the same.

Let's fix this.

<hr>

## String replacement, second time!

Above, we replaced ', TOTAL' with a space by using the following code:

`stats['Trade partner'] = stats['Trade partner'].str.replace(', TOTAL', '')`

Can you replace 'Imports' with 'Import'?
Check your work with the **unique( )** function when you're done!

In [None]:
stats['Type'] = stats['Type']

<hr>

## Use `str.replace()` to fix Export/Exports
Let's do the same thing with 'Exports'.

Remember to check using the `unique( )` function that you did everything correctly!

In [None]:
stats['Type'] = stats['Type']

<hr>

## Let's extract some data so we can make some charts!
So far, we have been working with ONE dataframe, called **stats**.

Let's make four more, each with a subset of the data from **stats** so it's easier to graph.

<hr>

## Using conditonal logic

We'll start by selecting all rows and columns for which:
* the **'Trade partner'** is **'ASIA'**

AND

* the **'Type'** is **'Import'**

Here, we're using conditional logic, saying IF both of these requirements are TRUE, put it all in a new dataframe. 

Let's call this new dataframe **'Asia_import'**

In [None]:
Asia_import = stats[(stats['Trade partner']=='ASIA') & (stats['Type'] == 'Import')]

<hr>

## Use `head()` to inspect your new dataframe
Let's use the <span style="color:blue">head( )</span> function to look at the first few lines of our new dataframe.

In [None]:
Asia_import

<hr>

Looks like it worked!! 

Let's do the same for exports to Asia. We want all rows and columns for which:

The **'Trade partner'** is **'ASIA'** *AND* the **'Type'** is **'Import'**

Let's call this new dataframe **'Asia_export'**. Remember to use `head()` to check if your dataframe is correct.

In [None]:
Asia_export = stats[(stats['COLUMN']=='VALUE') & (stats['COLUMN'] == 'VALUE')]


<hr>

## Use conditional logic to create new dataframes
Let's do the same for Africa.

If you get stuck, just copy the code from above and change **Asia** to **Africa**.

In [None]:
Africa_import = stats[()]
Africa_export = stats[()]

Let's check the first few lines of **Africa_export**.

<hr>

## Time to graph

Today, we will be making a line graph examining the **Amount** traded over time (**Year**).

Let's start by making a line graph showing the Amount imported from Asia between 2003-2018.

<hr>

## Import `Matplotlib`

Let's start by importing the following library, **Matplotlib** as well as telling Jupyter notebook to show our plots inline.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

## Now, let's make our first plot!

In this section, we won't be doing much typing. Most of the code will already be there and we'll just change a few specific parameters.

In [None]:
fig = plt.figure()
ax = fig.add_subplot()

ax.plot(
    DATAFRAME['COLUMN'],
    DATAFRAME['COLUMN'],
    c='blue', label='Import',
    linewidth=3
)

plt.show()

Now, let's try using the **Asia_export** dataframe.

In [None]:
fig = plt.figure()
ax = fig.add_subplot()

ax.plot(
    DATAFRAME['COLUMN'],
    DATAFRAME['COLUMN'],
    c='orange', label='Export',
    linewidth=3
)

plt.show()

Let's try putting them on the same plot, just copy and paste them together:

In [None]:
fig = plt.figure()
ax = fig.add_subplot()

## Paste the code from the two code blocks above here. Don't copy "plt.show()" since we've included it below.

plt.show()

<hr>

## Let's add some labels and a title to our chart.

In [None]:
# Copy the code from the box above and paste it right below this line. Do not copy 'plt.show()'
fig = plt.figure()
ax = fig.add_subplot()

ax.plot(
    Asia_import['Year'],
    Asia_import['Amount'],
    c='blue', label='Import',
    linewidth=3
)

ax.plot(
    Asia_export['Year'],
    Asia_export['Amount'],
    c='orange', label='Export',
    linewidth=3
)

#Now let's set the title and labels
ax.set_title('Asia imports from 2003-2018')   
ax.text(2010, 150000, 'Import')
ax.text(2015, 150000, 'Export')
   
plt.show()

<hr>

## Practice plotting with Africa_export and Africa_import

Let's do the same thing with the **Africa_export** and **Africa_import** dataframes. 

*Hint: Don't forget to change the parameters for `ax.text()` since the scale for the Africa plot is different from the Asia one.*

For more resources, check out Dataquest.io and Datacamp.com .