<img src="https://github.com/bitprj/DigitalHistory/blob/master/Week2-Introduction-to-Python-_-NumPy/assets/icons/bitproject.png?raw=1" width="200" align="left"> 
<img src="https://github.com/bitprj/DigitalHistory/blob/master/Week2-Introduction-to-Python-_-NumPy/assets/icons/data-science.jpg?raw=1" width="350" align="right">

# <h1 align="center">Data Science Review</h1>

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1w6_7LCUFQK9-UgDV7N_KDncbXWHMZru2">
</p>



## <h2 align="center">Let's Review What We've Learned So Far</h2>

- **Python**
  - Variables.
  - Data Structures.
  - A little about methods and libraries.
- **Pandas**
  - Loading ```.csv``` files.
  - Cleaning datasets (basics).
- **Matplotlib**
  - Simple plots
    - Bar
    - Scatter
  - Labeling and Marking.




# <h1 align="center">Python as our Main Programming Language</h1>



<p align="center">
<img src="https://github.com/bitprj/DigitalHistory/blob/master/Week2-Introduction-to-Python-_-NumPy/assets/python.png?raw=true" width=150></p>

## Types

**So what are types?**

Types are different categories in which data can be presented. They particularly represent the kind of values that would tell us which operations can be performed on them. 



| Type | Example | Also Known As |
|------|---------|---------------|
| ```int``` |2020| Integers |
|```float```|3.142| Decimals | 
|```str```|"Bit University"| Strings|
|```boolean``` | True or False | 1 or 0|


There are also other types such as ```bytes``` which simply converts any string/object type to binary form.
For example:

```Hello``` -> ```b'Hello'``` will make the string Hello a byte type.

As we mentioned above types help us identify what kind of operations can be performed with or on those values. Lets look at an example below.

In [None]:
1+1

In the example above, we added two integers together, and the answer came out as ```2```, which is as expected. Let's see what happens if we try to add together a string and an integer.

In [None]:
'Bit University'+1



```
# This is formatted as code
```

As you can see, this gives us an error. In case of any arithmetic operation between a string and a numeric value (```int``` or ```float```) you can perform one mathematical operation which is  multiplication (```*```).


In [None]:
'Bit University'*3

### Examples of data types

Feel free to practice with data types below. In the following lines of code, add ```type()``` and whatever is written inside the **cells** (a block of code is known as a cell), cut it and paste it in side the ```()``` for type. The first has been done for you.

In [None]:
type('Software Engineering is ....')

In [None]:
2020

In [None]:
21.003

In [None]:
True

## Variables

Above we have seen what basic data types are in Python, but what do we do with these? So far we have no way of storing them. Above when we did ```1+1``` we got the answer but we can't use it anywhere else. If I wanted to use the answer to ```1+1``` (which is 2 by the way) somewhere else, I would have to rewrite it? wait that doesn't seem right? 

The purpose of computer programs is to be able to use computed values in multiple places. Thus, the next topic we will cover is about **variables**. The good thing is that you can store every data type in its own variable *(Take it one step further and you're storing functions and classes, but that comes later).*

To store data as a variable, I simply declare what I want to name it, put a ```=``` next to it and write what I want to store.

***For example:***

In [None]:
x = 1+1

and there we have it, we just stored our variable. Want to confirm? just use the ```print()``` method and inside write your variable.

In [None]:
print(x)

Replace ```1+1``` with the following, one-by-one and print the result:
- ```'Bit Project'```
- ```2020.2```
- ```2**2```

and see what the results are

**Note:** 

The names you use when creating these labels need to follow a few rules:

    1. Names can not start with a number.
    2. There can be no spaces in the name, use _ instead.
    3. Can't use any of these symbols :'",<>/?|\()!@#$%^&*~-+
    4. Using lowercase names are best practice.
    5. Avoid using the characters 'l' (lowercase letter el), 'O' (uppercase letter oh), 
       or 'I' (uppercase letter eye) as single character variable names.
    6. Avoid using words that have special meaning in Python like "list" and "str"


Using variable names can be a very useful way to keep track of different variables in Python. 

For example:

In [None]:
# Use object names to keep better track of what's going on in your code!
income = 1000

tax_rate = 0.2

taxes = income*tax_rate

In [None]:
# Show the result!
print(taxes)

## Data Structures

## List


Earlier when discussing strings we introduced the concept of a *sequence*. Lists is the most generalized version of sequences in Python. Unlike strings, they are mutable, meaning the elements inside a list can be changed!

Lists are constructed with brackets [] and commas separating every element in the list.

Let's start with seeing how we can build a list.

In [None]:
# Assign a list to an variable named my_list
my_list = [1,2,3]

We just created a list of integers, but lists can actually hold elements of multiple data types. For example:

In [None]:
my_list = ['A string',23,100.232,'o']

Just like strings, the len() function will tell you how many items are in the sequence of the list.

In [None]:
len(my_list)

In [None]:
my_list = ['one','two','three',4,5]

In [None]:
# Grab element at index 0
my_list[0] 

In [None]:
# Grab index 1 and everything past it
my_list[1:]

In [None]:
# Grab everything UP TO index 3
my_list[:3]

We can also use + to concatenate lists, just like we did for strings.

In [None]:
my_list + ['new item']

Note: This doesn't actually change the original list!

In [None]:
my_list

In [None]:
x = [1,2,3,4,5,6,7]


## Dictionaries



We've been learning about *sequences* in Python but now we're going to switch gears and learn about *mappings* in Python. If you're familiar with other languages you can think of dictionaries as hash tables. 

So what are mappings? Mappings are a collection of objects that are stored by a *key*, unlike a sequence that stored objects by their relative position. This is an important distinction, since mappings won't retain order as is no *order* to keys..

A Python dictionary consists of a key and then an associated value. That value can be almost any Python object.

### Constructing a Dictionary
Let's see how we can build dictionaries and better understand how they work.

In [None]:
# Make a dictionary with {} and : to signify a key and a value
my_dictionary = {'key1':'value1','key2':'value2'}

In [None]:
# Call values by their key
my_dictionary['key2']

Its important to note that dictionaries are very flexible in the data types they can hold. For example:

In [None]:
my_dictionary = {'key1':123,'key2':[12,23,33],'key3':['item0','item1','item2']}

In [None]:
# Let's call items from the dictionary
my_dictionary['key3']

In [None]:
# Can call an index on that value
my_dictionary['key3'][0]

In [None]:
# Can then even call methods on that value
my_dictionary['key3'][0].upper()

We can affect the values of a key as well. For instance:

In [None]:
my_dictionary['key1']

In [None]:
# Subtract 123 from the value
my_dictionary['key1'] = my_dictionary['key1'] - 123

In [None]:
#Check
my_dictionary['key1']

A quick note, Python has a built-in method of doing a self subtraction or addition (or multiplication or division). We could have also used += or -= for the above statement. For example:

In [None]:
# Set the object equal to itself minus 123 
my_dictionary['key1'] -= 123
my_dictionary['key1']

We can also create keys by assignment. For instance if we started off with an empty dictionary, we could continually add to it:

In [None]:
# Create a new dictionary
d = {}

In [None]:
# Create a new key through assignment
d['animal'] = 'Dog'

In [None]:
# Can do this with any object
d['answer'] = 42

In [None]:
#Show
d

There are other important data types in Python such as ```tuples``` and ```sets``` but we won't be convering them in this workshop.

## Comparison Operators 

As stated previously, comparison operators allow us to compare variables and output a Boolean value (True or False). 

These operators are the exact same as what you've seen in Math, so there's nothing new here.

First we'll present a table of the comparison operators and then work through some examples:

<h2> Table of Comparison Operators </h2><p>  In the table below, a=9 and b=11.</p>

<table class="table table-bordered">
<tr>
<th style="width:10%">Operator</th><th style="width:45%">Description</th><th>Example</th>
</tr>
<tr>
<td>==</td>
<td>If the values of two operands are equal, then the condition becomes true.</td>
<td> (a == b) is not true.</td>
</tr>
<tr>
<td>!=</td>
<td>If values of two operands are not equal, then condition becomes true.</td>
<td>(a != b) is true</td>
</tr>
<tr>
<td>&gt;</td>
<td>If the value of left operand is greater than the value of right operand, then condition becomes true.</td>
<td> (a &gt; b) is not true.</td>
</tr>
<tr>
<td>&lt;</td>
<td>If the value of left operand is less than the value of right operand, then condition becomes true.</td>
<td> (a &lt; b) is true.</td>
</tr>
<tr>
<td>&gt;=</td>
<td>If the value of left operand is greater than or equal to the value of right operand, then condition becomes true.</td>
<td> (a &gt;= b) is not true. </td>
</tr>
<tr>
<td>&lt;=</td>
<td>If the value of left operand is less than or equal to the value of right operand, then condition becomes true.</td>
<td> (a &lt;= b) is true. </td>
</tr>
</table>

Let's now work through quick examples of each of these.

#### Equal

In [None]:
4 == 4

In [None]:
1 == 0

# Functions

Functions are blocks of code that perform some specific task and will run when they are called.

For example, 


In [None]:
def sum(a, b):
  answer = a + b
  return answer

This is an example of a function that will return the sum of two numbers when it is called.

Let's see a case of running this example:

In [None]:
sum(4, 6)

As you can see, when we call the `sum()` function with any values as parameters, it performs the code within the function and returns the output.

You can write your own functions to perform any task you want, but Python also has built-in functions for Pandas and Matplotlib.

We will work with these built-in functions throughout our Data Science pipeline.

**Other Things To Review:**
- If-else statements
- Loops
- Modules and Packages


## <h1 align="center">Pandas for Data Manipulation</h1>

<p align="center">
<img src="https://github.com/bitprj/DigitalHistory/blob/master/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/icons/pandas.png?raw=true" width=200></p>


### What is Pandas?
This week, we will cover the basic data manipulation using Pandas.
1. Pandas is an open source data analysis and manipulation tool and it is widely used both in academia and industry.
2. It is built on top of the Python programming language. 
3. It offers data structures and operations for manipulating numerical tables and time series.

### Data Frames in Pandas
1. A DataFrame is 2-dimensional labelled data structure with both rows and columns and 2-dimensional array represents a tabluar data.

For this workshop we will focus on DataFrames.

Let's see how Dataframes work in Pandas.

Lets say we have 7 lists of equal length 8.
List #:
1. Human id
2. First Name
3. Last Name
4. Age
5. Major
6. Gender
7. Number of Cars they own


*Disclaimer : All the data used for this section is fictional, i.e I made it up*

In [25]:
human_id = [42000500,560000001,210004342,9913124,20141252,9412414,661245,11,1245124,10]
f_name = ['Shayan','Amy','Daniel','Atul','Narae','Kyle','Shreya','Mohammad','Romaiza','Lionel']
l_name = ['Riyaz','Cu','Kim','Jayaram','Lee','Begovich','Gupta','Salah','Ibad','Messi']
Age =    [34,21,24,20,None,24,19,27,21,35]
Major = ['Dance','Computer Science','Electrical Engineering','Computer Science','Computer Science','Marketing','Business Studies','Sportsman','Political Science','Athletics']
Gender = ['Male','Female','Male','Male','Female','Male','Male',None,'Female','Male']
number_of_cars = [None,1,2,0,1,3,2,None,1,6]


In [26]:
if (len(human_id) & len(f_name) & len(l_name) & len(Age) & len(Major) & len(Gender)& len(number_of_cars)):
  print('Equal Length')
else:
  print('Invalid Lengths')

Equal Length


### Importing Libraries

In [27]:
import pandas as pd

### Storing our lists into a dictionary

In [28]:
data = {
    "Human ID": human_id,
    "First Name" : f_name,
    "Last Name" : l_name,
    "Age" :    Age,
    "Major" : Major,
    "Gender" : Gender,
    "Number of Cars" : number_of_cars
    }

### Converting our dictionary into a dataframe

In [29]:
data_frame = pd.DataFrame(data)
data_frame

Unnamed: 0,Human ID,First Name,Last Name,Age,Major,Gender,Number of Cars
0,42000500,Shayan,Riyaz,34.0,Dance,Male,
1,560000001,Amy,Cu,21.0,Computer Science,Female,1.0
2,210004342,Daniel,Kim,24.0,Electrical Engineering,Male,2.0
3,9913124,Atul,Jayaram,20.0,Computer Science,Male,0.0
4,20141252,Narae,Lee,,Computer Science,Female,1.0
5,9412414,Kyle,Begovich,24.0,Marketing,Male,3.0
6,661245,Shreya,Gupta,19.0,Business Studies,Male,2.0
7,11,Mohammad,Salah,27.0,Sportsman,,
8,1245124,Romaiza,Ibad,21.0,Political Science,Female,1.0
9,10,Lionel,Messi,35.0,Athletics,Male,6.0


### Changing the Index 

In [30]:
data_frame.set_index(keys='Human ID',inplace = True)
data_frame

Unnamed: 0_level_0,First Name,Last Name,Age,Major,Gender,Number of Cars
Human ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
42000500,Shayan,Riyaz,34.0,Dance,Male,
560000001,Amy,Cu,21.0,Computer Science,Female,1.0
210004342,Daniel,Kim,24.0,Electrical Engineering,Male,2.0
9913124,Atul,Jayaram,20.0,Computer Science,Male,0.0
20141252,Narae,Lee,,Computer Science,Female,1.0
9412414,Kyle,Begovich,24.0,Marketing,Male,3.0
661245,Shreya,Gupta,19.0,Business Studies,Male,2.0
11,Mohammad,Salah,27.0,Sportsman,,
1245124,Romaiza,Ibad,21.0,Political Science,Female,1.0
10,Lionel,Messi,35.0,Athletics,Male,6.0


### Printing the first 5 rows of our dataframe

In [31]:
data_frame.head()

Unnamed: 0_level_0,First Name,Last Name,Age,Major,Gender,Number of Cars
Human ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
42000500,Shayan,Riyaz,34.0,Dance,Male,
560000001,Amy,Cu,21.0,Computer Science,Female,1.0
210004342,Daniel,Kim,24.0,Electrical Engineering,Male,2.0
9913124,Atul,Jayaram,20.0,Computer Science,Male,0.0
20141252,Narae,Lee,,Computer Science,Female,1.0


### Identifying basic properties of our dataframe


In [32]:
data_frame.describe()

Unnamed: 0,Age,Number of Cars
count,9.0,8.0
mean,25.0,2.0
std,5.91608,1.85164
min,19.0,0.0
25%,21.0,1.0
50%,24.0,1.5
75%,27.0,2.25
max,35.0,6.0


### Data Frame information

In [33]:
data_frame.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 42000500 to 10
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   First Name      10 non-null     object 
 1   Last Name       10 non-null     object 
 2   Age             9 non-null      float64
 3   Major           10 non-null     object 
 4   Gender          9 non-null      object 
 5   Number of Cars  8 non-null      float64
dtypes: float64(2), object(4)
memory usage: 560.0+ bytes


### Shape of our dataframe

In [34]:
data_frame.shape

(10, 6)

### Printing Columns in our dataframe

In [35]:
#print column names
data_frame.columns

Index(['First Name', 'Last Name', 'Age', 'Major', 'Gender', 'Number of Cars'], dtype='object')

### Reseting our dataframe index

In [36]:
data_frame.reset_index(inplace=True)

In [37]:
data_frame

Unnamed: 0,Human ID,First Name,Last Name,Age,Major,Gender,Number of Cars
0,42000500,Shayan,Riyaz,34.0,Dance,Male,
1,560000001,Amy,Cu,21.0,Computer Science,Female,1.0
2,210004342,Daniel,Kim,24.0,Electrical Engineering,Male,2.0
3,9913124,Atul,Jayaram,20.0,Computer Science,Male,0.0
4,20141252,Narae,Lee,,Computer Science,Female,1.0
5,9412414,Kyle,Begovich,24.0,Marketing,Male,3.0
6,661245,Shreya,Gupta,19.0,Business Studies,Male,2.0
7,11,Mohammad,Salah,27.0,Sportsman,,
8,1245124,Romaiza,Ibad,21.0,Political Science,Female,1.0
9,10,Lionel,Messi,35.0,Athletics,Male,6.0


### Dropping a column

In [38]:
data_frame.drop(columns = ['Human ID'],inplace = True)
data_frame

Unnamed: 0,First Name,Last Name,Age,Major,Gender,Number of Cars
0,Shayan,Riyaz,34.0,Dance,Male,
1,Amy,Cu,21.0,Computer Science,Female,1.0
2,Daniel,Kim,24.0,Electrical Engineering,Male,2.0
3,Atul,Jayaram,20.0,Computer Science,Male,0.0
4,Narae,Lee,,Computer Science,Female,1.0
5,Kyle,Begovich,24.0,Marketing,Male,3.0
6,Shreya,Gupta,19.0,Business Studies,Male,2.0
7,Mohammad,Salah,27.0,Sportsman,,
8,Romaiza,Ibad,21.0,Political Science,Female,1.0
9,Lionel,Messi,35.0,Athletics,Male,6.0


### Adding a new column

In [39]:
Joined_bit_project = ['2020-3-4','2020-4-5','2019-5-6','2019-12-27','2020-5-16','2020-4-20','2019-9-3','2020-10-11','2020-10-10','2018-5-1']

data_frame['Joined_bit_project'] = Joined_bit_project


In [40]:
data_frame

Unnamed: 0,First Name,Last Name,Age,Major,Gender,Number of Cars,Joined_bit_project
0,Shayan,Riyaz,34.0,Dance,Male,,2020-3-4
1,Amy,Cu,21.0,Computer Science,Female,1.0,2020-4-5
2,Daniel,Kim,24.0,Electrical Engineering,Male,2.0,2019-5-6
3,Atul,Jayaram,20.0,Computer Science,Male,0.0,2019-12-27
4,Narae,Lee,,Computer Science,Female,1.0,2020-5-16
5,Kyle,Begovich,24.0,Marketing,Male,3.0,2020-4-20
6,Shreya,Gupta,19.0,Business Studies,Male,2.0,2019-9-3
7,Mohammad,Salah,27.0,Sportsman,,,2020-10-11
8,Romaiza,Ibad,21.0,Political Science,Female,1.0,2020-10-10
9,Lionel,Messi,35.0,Athletics,Male,6.0,2018-5-1


In [41]:
data_frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   First Name          10 non-null     object 
 1   Last Name           10 non-null     object 
 2   Age                 9 non-null      float64
 3   Major               10 non-null     object 
 4   Gender              9 non-null      object 
 5   Number of Cars      8 non-null      float64
 6   Joined_bit_project  10 non-null     object 
dtypes: float64(2), object(5)
memory usage: 688.0+ bytes


### Converting a column to datetime

In [42]:
data_frame['Joined_bit_project'] = pd.to_datetime(data_frame['Joined_bit_project'])
data_frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   First Name          10 non-null     object        
 1   Last Name           10 non-null     object        
 2   Age                 9 non-null      float64       
 3   Major               10 non-null     object        
 4   Gender              9 non-null      object        
 5   Number of Cars      8 non-null      float64       
 6   Joined_bit_project  10 non-null     datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(4)
memory usage: 688.0+ bytes


In [43]:
data_frame

Unnamed: 0,First Name,Last Name,Age,Major,Gender,Number of Cars,Joined_bit_project
0,Shayan,Riyaz,34.0,Dance,Male,,2020-03-04
1,Amy,Cu,21.0,Computer Science,Female,1.0,2020-04-05
2,Daniel,Kim,24.0,Electrical Engineering,Male,2.0,2019-05-06
3,Atul,Jayaram,20.0,Computer Science,Male,0.0,2019-12-27
4,Narae,Lee,,Computer Science,Female,1.0,2020-05-16
5,Kyle,Begovich,24.0,Marketing,Male,3.0,2020-04-20
6,Shreya,Gupta,19.0,Business Studies,Male,2.0,2019-09-03
7,Mohammad,Salah,27.0,Sportsman,,,2020-10-11
8,Romaiza,Ibad,21.0,Political Science,Female,1.0,2020-10-10
9,Lionel,Messi,35.0,Athletics,Male,6.0,2018-05-01


### Dropping Null Values

In [44]:
data_frame.dropna(inplace = True)
data_frame

Unnamed: 0,First Name,Last Name,Age,Major,Gender,Number of Cars,Joined_bit_project
1,Amy,Cu,21.0,Computer Science,Female,1.0,2020-04-05
2,Daniel,Kim,24.0,Electrical Engineering,Male,2.0,2019-05-06
3,Atul,Jayaram,20.0,Computer Science,Male,0.0,2019-12-27
5,Kyle,Begovich,24.0,Marketing,Male,3.0,2020-04-20
6,Shreya,Gupta,19.0,Business Studies,Male,2.0,2019-09-03
8,Romaiza,Ibad,21.0,Political Science,Female,1.0,2020-10-10
9,Lionel,Messi,35.0,Athletics,Male,6.0,2018-05-01


In [45]:
data_frame.reset_index(inplace = True)

### Deleting a Column

In [None]:
del data_frame['index']

In [None]:
data_frame

### Sorting a Column

In [46]:
data_frame.sort_values(by = 'Age',inplace = True)

In [47]:
data_frame

Unnamed: 0,index,First Name,Last Name,Age,Major,Gender,Number of Cars,Joined_bit_project
4,6,Shreya,Gupta,19.0,Business Studies,Male,2.0,2019-09-03
2,3,Atul,Jayaram,20.0,Computer Science,Male,0.0,2019-12-27
0,1,Amy,Cu,21.0,Computer Science,Female,1.0,2020-04-05
5,8,Romaiza,Ibad,21.0,Political Science,Female,1.0,2020-10-10
1,2,Daniel,Kim,24.0,Electrical Engineering,Male,2.0,2019-05-06
3,5,Kyle,Begovich,24.0,Marketing,Male,3.0,2020-04-20
6,9,Lionel,Messi,35.0,Athletics,Male,6.0,2018-05-01


In [48]:
data_frame.describe()

Unnamed: 0,index,Age,Number of Cars
count,7.0,7.0,7.0
mean,4.857143,23.428571,2.142857
std,3.023716,5.442338,1.9518
min,1.0,19.0,0.0
25%,2.5,20.5,1.0
50%,5.0,21.0,2.0
75%,7.0,24.0,2.5
max,9.0,35.0,6.0


# <h1 align="center">Matplotlib and Pandas Visualization</h1>

<p align="center">
<img src="https://github.com/bitprj/DigitalHistory/blob/master/Week4-Introduction-to-data-visualization-and-graphs-with-matplotlib/assets/matplotlib-logo.png?raw=true" width=400>
<img src="https://github.com/bitprj/DigitalHistory/blob/master/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/icons/pandas.png?raw=true" width=200></p>






## What is Matplotlib

The ```matplotlib``` library is a Python 2-Dimensional plotting (x and y) library which allows us to generate:
- Line plots
- Scatter plots
- Histograms
- Barplots

In most plotting cases we use the ```pyplot``` library in ```matplotlib```

## How does Pandas work for visualizations?
Once advantage of Python as a language is that it can inherit features of different libraries. In the case for Pandas, it inherits features from matplotlib. Although matplotlib is useful throughout many areas (research,business, engineering), in our use cases pandas works very well, because it allows us to plot a dataframe directly.

In the exercise below, we'll be looking at a simple way of adding important features to plots of our dataFrame.

Our dataset for this exercise will be the stock price for 9 listed companies from (1990 to 2012).

Although I am showing the pandas version, in the comments ```#``` I have shown the ```matplotlib.pyplot``` implementation.

### Load Libraries

In [49]:
import pandas as pd
import matplotlib.pyplot as plt

### Load DataFrame

In [51]:
# url = 'https://tinyurl.com/y64sugg4' # line 1
url = 'https://raw.githubusercontent.com/bitprj/BitUniversity/Amy_SDSU/Week4-Introduction-to-data-visualization-and-graphs-with-matplotlib/data/stock_px/stock_px.csv' 
df = pd.read_csv(url,index_col = 0,parse_dates = True ) # line 2
df

Unnamed: 0,AA,AAPL,GE,IBM,JNJ,MSFT,PEP,SPX,XOM
1990-02-01,4.98,7.86,2.87,16.79,4.27,0.51,6.04,328.79,6.12
1990-02-02,5.04,8.00,2.87,16.89,4.37,0.51,6.09,330.92,6.24
1990-02-05,5.07,8.18,2.87,17.32,4.34,0.51,6.05,331.85,6.25
1990-02-06,5.01,8.12,2.88,17.56,4.32,0.51,6.15,329.66,6.23
1990-02-07,5.04,7.77,2.91,17.93,4.38,0.51,6.17,333.75,6.33
...,...,...,...,...,...,...,...,...,...
2011-10-10,10.09,388.81,16.14,186.62,64.43,26.94,61.87,1194.89,76.28
2011-10-11,10.30,400.29,16.14,185.00,63.96,27.00,60.95,1195.54,76.27
2011-10-12,10.05,402.19,16.40,186.12,64.33,26.96,62.70,1207.25,77.16
2011-10-13,10.10,408.43,16.22,186.82,64.23,27.18,62.36,1203.66,76.37


In **line 1**, we are saving the string url of the location of our dataset as the variable ```url```.

In **line2**, we are using the pandas ```read_csv```and adding the url variable inside as a parameter. we have also added two more paramters ```index_col = 0``` which indicates that the first column (in python counting starts from 0) is our index. Since I already knew that the index column is a date column I added ```parse_dates = True```.  





### Plot 1 Column ```AA```

In [None]:
df.plot(y ='AA')



# plt.plot(df['AA'])
# plt.legend(['AA])
# plt.show()

Above I haven't declared an ```x``` axis since it is already assume to be the index (in this case date).


### Plot 2 Columns [```AA```,```MSFT```]

In [None]:
df.plot(y= ['AA','MSFT'])

# plt.plot(df['AA'])
# plt.plot(df['MSFT'])
# plt.legend(['AA','MSFT'])
# plt.show()

### Adding Labels

In [None]:
df.plot(y= ['AA','MSFT'],
        xlabel = 'Date from 1990 - 2012',
        ylabel = 'Stock Market Price')



# plt.plot(df['AA'])
# plt.plot(df['MSFT'])
# plt.xlabel('Date from 1990 - 2012')
# plt.ylabel('Stock Market Price')
# plt.legend(['AA','MSFT'])
# plt.show()

### Adding a Title

In [None]:
df.plot(y= ['AA','MSFT'],
        xlabel = 'Date from 1990 - 2012',
        ylabel = 'Stock Market Price',
        title = 'Stock Market Index for AA and MSFT')

# plt.plot(df['AA'])
# plt.plot(df['MSFT'])
# plt.xlabel('Date from 1990 - 2012')
# plt.ylabel('Stock Market Price')
# plt.title('Stock Market Index for AA and MSFT')
# plt.legend(['AA','MSFT'])
# plt.show()

### Using alpha to make the plots more transparent

In [None]:
df.plot(y= ['AA','MSFT'],
        xlabel = 'Date from 1990 - 2012',
        ylabel = 'Stock Market Price',
        title = 'Stock Market Index for AA and MSFT',
        alpha = 0.8)

# plt.plot(df['AA'],alpha = 0.8)
# plt.plot(df['MSFT'],alpha =0.8)
# plt.xlabel('Date from 1990 - 2012')
# plt.ylabel('Stock Market Price')
# plt.title('Stock Market Index for AA and MSFT')
# plt.legend(['AA','MSFT'])
# plt.show()

### Changing the size of the figure

In [None]:
df.plot(y= ['AA','MSFT'],
        xlabel = 'Date from 1990 - 2012',
        ylabel = 'Stock Market Price',
        title = 'Stock Market Index for AA and MSFT',
        alpha = 0.8,
        figsize = (10,7))


# fig = plt.figure(figsize=(10,7))
# plt.plot(df['AA'],alpha = 0.8)
# plt.plot(df['MSFT'],alpha =0.8)
# plt.xlabel('Date from 1990 - 2012')
# plt.ylabel('Stock Market Price')
# plt.title('Stock Market Index for AA and MSFT')
# plt.legend(['AA','MSFT'])
# plt.show()

### Changing Colors and Adding markers

In [None]:
df.plot(y= ['AA','MSFT'],
        xlabel = 'Date from 1990 - 2012',
        ylabel = 'Stock Market Price',
        title = 'Stock Market Index for AA and MSFT',
        alpha = 0.8,
        figsize = (20,10),
        style=['ro','bx'])


# fig = plt.figure(figsize=(10,7))
# plt.plot(df['AA'],'ro',alpha = 0.8)
# plt.plot(df['MSFT'],'bx',alpha =0.8)
# plt.xlabel('Date from 1990 - 2012')
# plt.ylabel('Stock Market Price')
# plt.title('Stock Market Index for AA and MSFT')
# plt.legend(['AA','MSFT'])
# plt.show()



### Limiting our x-axis

In [None]:
import datetime

In [None]:
df.plot(y= ['AA','MSFT'],
        xlabel = 'Date from 1990 - 2012',
        ylabel = 'Stock Market Price',
        title = 'Stock Market Index for AA and MSFT',
        alpha = 0.8,
        figsize = (20,10),
        style=['ro','bx'],
        xlim = (datetime.date(2006,1,1),datetime.date(2010,1,1)))

# fig = plt.figure(figsize=(10,7))
# plt.plot(df['AA'],'ro',alpha = 0.8)
# plt.plot(df['MSFT'],'bx',alpha =0.8)
# plt.xlabel('Date from 1990 - 2012')
# plt.ylabel('Stock Market Price')
# plt.title('Stock Market Index for AA and MSFT')
# plt.legend(['AA','MSFT'])
# plt.xlim(datetime.date(2006,1,1),datetime.date(2010,1,1))
# plt.show()

### Plotting Histograms

In [None]:
df.hist('IBM',bins = 20)

# plt.hist(df['IBM'],bins=30)
# plt.show()

<p align="center">
<img src="https://i.pinimg.com/originals/89/d9/e0/89d9e0f67c361865fe9746c3c3de6b8a.gif" width="450">
</p>

# <div align="center">Analyzing the 16th Century Transatlantic Slave voyages</div>

<p align="center">
<img src="https://scx2.b-cdn.net/gfx/news/hires/2017/590c5b1f40cbd.gif" width="800">
</p>

## <h2 align="center">Loading and modifying the dataset</h2>

In [None]:
import matplotlib.pyplot as plt
import pandas as pd


In [None]:
url = 'https://raw.githubusercontent.com/bitprj/DigitalHistory/master/Week5-Lab-Visualizing-the-Translatlantic-Slave-Trade/data/trans-atlantic-slave-trade/trans-atlantic-slave-trade.csv'

trans_atlc_trade = pd.read_csv(url)

In [None]:
trans_atlc_trade

### Important Facts About the Dataset

In [None]:
Unaccounted_trips = trans_atlc_trade['Slaves arrived at 1st port'].isna().sum()
print(f'The total number of unaccounted trips is:  {Unaccounted_trips}')


In [None]:
number_of_slaves_accounted = trans_atlc_trade['Slaves arrived at 1st port'].sum()
print(f'The total number of slaves accounted for are: {number_of_slaves_accounted}')

Historical Estimates suggest that the total number of slave traded are estimated to be ~12.5 Million. This means that according to this dataset:
- Around ```7436701``` slaves are not accounted for. *(12500000 - 5063299)*



In [None]:
num_of_unaccounted_slaves = 12_500_000 - number_of_slaves_accounted
print(num_of_unaccounted_slaves)

### Changing column names






In [None]:
#print column names
trans_atlc_trade.columns

Next we create a **python** dictionary where will write the names of the new columns. If you see below, we have kept the keys as names of the original columns. Lets look at why



In [None]:
new_col_names ={"Voyage ID": 'voyage_id',
                "Vessel name": 'vessel_name',
                "Voyage itinerary imputed port where began (ptdepimp) place": 'voyage_started',
                "Voyage itinerary imputed principal place of slave purchase (mjbyptimp) ": 'voyage_pit_stop',
                "Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place": 'end_port',
                "Year of arrival at port of disembarkation":'year_of_arrival',
                "Slaves arrived at 1st port":'slaves_onboard',
                "Captain's name" : 'captain_names'
            }

In [None]:
trans_atlc_trade = trans_atlc_trade.rename(columns=new_col_names)

trans_atlc_trade

### Moving Column Positions - ```trans_atlc_trade.reindex()```

In [None]:
column_names = ['voyage_id',"year_of_arrival","vessel_name", "voyage_started","voyage_pit_stop", "end_port","slaves_onboard"]

trans_atlc_trade = trans_atlc_trade.reindex(columns=column_names)
trans_atlc_trade

**Questions**

**Is Voyage ID a good index and do we need it as a column?**

No, But we need an index.

**Can 'year_of_arrival' be an Index?**

No, because there are repeating dates in the charts, there for we need a simple log counter.



### Remove Voyage ID -```trans_atlc_trade.drop()```

Now that we have a new index from 0 to 15299.

Do we need ```voyage_id```. I don't think so, because it doesn't help us find anything useful. Every Voyage ID is unique.

Next, drop this columnn

In [None]:
trans_atlc_trade = trans_atlc_trade.drop(columns='voyage_id')
trans_atlc_trade

### Using ```dropna()```

For this data set we will be working with trips that were completely accounted for in all of the remaining features.

In [None]:
trans_atlc_trade = trans_atlc_trade.dropna()
trans_atlc_trade

In [None]:
trans_atlc_trade.info()

### Changing Column Type and Sorting - ```trans_atlc_trade.sort_values()```

In [None]:
trans_atlc_trade = trans_atlc_trade.sort_values(by='year_of_arrival', ascending=True)
trans_atlc_trade

### Reseting the Index
Reseting Index againt with ```year_of_arrival``` in Ascending Order.

In [None]:
trans_atlc_trade.reset_index(inplace=True, drop=True)
trans_atlc_trade

### Finding Unique and similar strings 
```trans_atlc_trade['column_name].unique()``` and ```trans_atlc_trade.sort()```

In [None]:
a = trans_atlc_trade['voyage_started'].unique()
a.sort()
a

In [None]:
a = trans_atlc_trade['voyage_pit_stop'].unique()
a.sort()
a

In [None]:
a = trans_atlc_trade['end_port'].unique()
a.sort()
a

### Working with Strings - ```trans_atlc_trade['column_name'].str.replace()```

To replace unwanted parts a string we use the function ```trans_atlc_trade['columun_name'].str.replace('string to find','string to replace')```. This command looks for the string we have specified and replaces with what we want.

For example:
If have an entry in the 'voyage_started' column, 'Virginia, port unspecified'. By running the command:
```trans_atlc_trade['voyage_started'].str.replace(', port unspecified', '')```
The string will be changed from ''Virginia, port unspecified' to 'Virginia'.

In [None]:
trans_atlc_trade['voyage_started'] = trans_atlc_trade['voyage_started'].str.replace(', port unspecified', '')
trans_atlc_trade['voyage_started'] = trans_atlc_trade['voyage_started'].str.replace(', colony unspecified', '')
trans_atlc_trade['voyage_started'] = trans_atlc_trade['voyage_started'].str.replace('.', '')

trans_atlc_trade['voyage_pit_stop'] = trans_atlc_trade['voyage_pit_stop'].str.replace(', port unspecified', '')
trans_atlc_trade['voyage_pit_stop'] = trans_atlc_trade['voyage_pit_stop'].str.replace(', colony unspecified', '')
trans_atlc_trade['voyage_pit_stop'] = trans_atlc_trade['voyage_pit_stop'].str.replace('.', '')

trans_atlc_trade['end_port'] = trans_atlc_trade['end_port'].str.replace(', port unspecified', '')
trans_atlc_trade['end_port'] = trans_atlc_trade['end_port'].str.replace(', colony unspecified', '')
trans_atlc_trade['end_port'] = trans_atlc_trade['end_port'].str.replace('.', '')
trans_atlc_trade['end_port'] = trans_atlc_trade['end_port'].str.replace(' (colony unspecified)', '')
trans_atlc_trade['end_port'] = trans_atlc_trade['end_port'].str.replace(', unspecified', '')
trans_atlc_trade['end_port'] = trans_atlc_trade['end_port'].str.replace(',unspecified', '')

trans_atlc_trade

In [None]:
trans_atlc_trade.dtypes

## <h2 align="center">Micro Wrangling and Visualization</h2>

We will Start of this part by dividing our dataset into multiple smaller dataframes. The approach we will be taking is separating dataframes based on the ```year_of_arrival``` dataset.

For example, in the blocks below you will see code for 4 intervals:
- ```1500 to 1600```
- ```1601 to 1700```
- ```1701 to 1800```
- ```1801 to 1900```



## Between 1500 to 1600

In [None]:
dataset_between15_16 = trans_atlc_trade.where((trans_atlc_trade['year_of_arrival'] >= 1500) & (trans_atlc_trade['year_of_arrival'] <= 1600))
dataset_between15_16

In [None]:
dataset_between15_16 = dataset_between15_16.dropna()
dataset_between15_16

### Total Number of Slaves Transported between 1501-1600 - Complete Records


In [None]:
dataset_between15_16.slaves_onboard.sum()

### Visualizing Trips During 1501-1601

### Pandas Plots

In [None]:
fig = plt.figure(figsize = (50,20))
ax1 = fig.add_subplot(2,2,1)

dataset_between15_16.plot(x='vessel_name',
                         y = 'slaves_onboard',
                         kind = 'bar', ax = ax1,
                         rot = 90)

### Ships Carrying Less than 100 slaves per trip

In [None]:
temp_plot = dataset_between15_16.where(dataset_between15_16['slaves_onboard'] < 100.0).dropna()

temp_plot

In [None]:
temp_plot.plot(x='vessel_name',
                         y = 'slaves_onboard',kind = 'bar',
                         rot = 90)



### Ships Carrying greater than 100 slaves per trip

In [None]:
temp_plot = dataset_between15_16.where(dataset_between15_16['slaves_onboard'] > 100.0).dropna()

temp_plot.plot(x='vessel_name',
                         y = 'slaves_onboard',kind = 'bar',
                         rot = 90,grid = True,figsize = (20,10))

temp_plot

In [None]:
dataset_between15_16.describe()

In [None]:
third_quartile_slaves_onboard = int(dataset_between15_16['slaves_onboard'].quantile(.75))
print(third_quartile_slaves_onboard)


# third_quartile_slaves_onboard = 202

### Ships Carrying greater than 202 slaves per trip

In [None]:
temp_plot = dataset_between15_16.where(dataset_between15_16['slaves_onboard'] > third_quartile_slaves_onboard).dropna()
print(f'There are {temp_plot.shape[0]} trips that carries more than {third_quartile_slaves_onboard} slaves.')



In [None]:
temp_plot.plot(x='vessel_name',
                         y = 'slaves_onboard',kind = 'bar',
                         rot = 45,grid = True,figsize = (20,10))




### Check the most used 'start_port'

In [None]:
temp_plot['voyage_started'].hist(bins = 20,
                                 alpha = 0.5,
                                 xrot = 45,
                                 figsize = (10,10)
                                )

### Check the most used 'voyage_pit_stop'

In [None]:
temp_plot['voyage_pit_stop'].hist(bins=20, 
                           alpha=0.7,
                           xrot = 0,
                          figsize = (10,10))

### Check the most used 'End_Port'

In [None]:
temp_plot['end_port'].hist(bins=20, 
                           alpha=0.7,
                           xrot = 0,
                          figsize = (10,10))

### Observations 

Majority of The Ships carrying over 200 people started their journey Portugal made a Pit-Stop in Africa and Went to Cartagena, Colombia 

# <h1 align="center">Thank you!</h1> 

Additional Resources

- Python Documentation:
  - https://www.pythoncheatsheet.org/
  - https://docs.python.org/3/

- Pandas Introduction and Documentation:
  - https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
  - https://medium.com/datadriveninvestor/python-pandas-library-for-beginners-a-simplified-guide-for-getting-started-and-ditching-20992b7cd4da

- MatPlotLib Introduction and Documentation:
  - https://matplotlib.org/
  - https://realpython.com/python-matplotlib-guide/
  - https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html 



