# Lab 2: Laboratory Notes - Week 2: Data Sources

Start here if you are already familiar with Python (if not, please do last week's homework).

## Python Libraries and Useful Data Science Libraries

There are some built-in functions that we have seen, such as <span style="color:red">type()</span>.  We usually have the parenthesis (brackets) to indicate that it is a function that we are referring to, as opposed to possibly a variable name or an attribute.  In most programming languages, we have many functions that have been developed, but as a programmer, we do not need all of them.  Hence, these useful functions are packaged together for specific domains, and they are grouped together as libraries. A library is therefore a collection of prewritten code that programmers can use for specific tasks in their software application domain.  This makes them reusable where needed.

### Python Libraries for Data Science

Specific libraries that are considered as the “starter pack” for Data Science:

* **Numpy**: Firstly, it is pronounced "numb-pie" and not "num-pee".  It is used for scientific computing, with support for multi-dimensional arrays (you will hopefully have seen arrays in your earlier courses!)  
* **Pandas**: The cute black and white Asian bear!  It is actually the shortening of "Panel Data".  It contains the data structures as well as operations for manipulating numerical tables.  Do note that you can have non-numerical data in the tables, much like an MS-Excel sheet.  
* **Matplotlib**: Library for visualisation.  It is ported from Matlab (a programming and numerical computing platform used for analysing data, developing algorithms, and creating models) and hence has the similar scientific look for the output.  Another package that is popularly used is the Python <span style="color:red">seaborn</span> package, which is also based on Matlab.  These are suitable for scientific output, but in general, for commercial data sciecne projects, the data is visualised using other software tools such as MS-PowerBI, Tableau, SAS Visualiser, and QLik.  
* **Scikit-learn**: A Python machine learning library that provides the tools for data mining and data analysis.
For some, you may also want to look at

* **NLTK**: Natural Language ToolKit to work with human language data

#### Using the Library

In order to use the functions (and objects) available in the libraries, we will need to include (or import) them.  In Python, you will see the following ways to import them.  
<span style="color:red">import numpy</span>  
If you import like above, you will need to specify the library and function, e.g., <span style="color:red">numpy.sin(3.1415926535/2)</span>.  You can also use the <span style="color:red">numpy</span> object representation for pi, e.g., <span style="color:red">numpy.sin(numpy.pi/2)</span>.  Typing numpy each time may be cumbersome so we can import it and use a different shortened reference.  
<span style="color:red">import numpy as np  
import pandas as pd</span>  
With this, we can just use <span style="color:red">np.sin(np.pi/2)</span>.  Some libraries are very large and we may not want to import the whole library.  Libraries such as <span style="color:red">numpy</span> take up about 50+ megabytes.  Taking <span style="color:red">matplotlib</span> as an example, we usually do not need the whole library but just the plotting functions for our course.  Hence, we can write  
<span style="color:red">from matplotlib import pyplot as plt</span>  
or  
<span style="color:red">import matplotlib.pyplot as plt</span>

## Starting Data Science Projects

### Introduction

We will start with an introduction to the pandas DataFrame.  The pandas DataFrame is a structure of a tabular format (2-dimensional data) similar to MS-Excel. They are widely used in data science projects.  Let's first create our own DataFrame, and call it "<span style="color:red">df</span>".

<span style="color:red">df = pd.DataFrame({</span>  
<p style="margin-left: 40px;"><span style="color:red">'StudentID' : [68822,68823,68844,68845,68846],</span></p>  
<p style="margin-left: 40px;"><span style="color:red">'FirstName' : ['Steven','Alex','Bill','Mark','Bob'],</span></p>  
<p style="margin-left: 40px;"><span style="color:red">'EnrolYear' : [2010,2010,2011,2011,2013],</span></p>  
<p style="margin-left: 40px;"><span style="color:red">'Math'      : [100,90,90,40,60]</span></p>  
<span style="color:red">})</span>  

Do note that the structure of it is defined by the <span style="color:red">{ }</span>, and inside the <span style="color:red">{ }</span>, we can specify many key-value pairs of <span style="color:red">'string':[item list]</span> where the items can be numeric or objects (in this case strings) and these key-value pairs are separated by a comma '<span style="color:red">,</span>'. Note that the last key-value pair does not need a comma at the end.  You can view the DataFrame by typing:  
<span style="color:red">df</span>  
Let's read from another source of data.  On our canvas page, there should be a file called <span style="color:red">filename.csv</span>.  CSV stands for comma-separated-values, where each field is separated by a comma.  You can open this file using a plain text editor and viewing it will be self-explanatory.  This is a format that most spreadsheets (such as MS-Excel or Google Sheets) can read or write to.

### Reading from a CSV file

We will use the <span style="color:red">pandas</span> library to read it.  <span style="color:red">pandas</span> can also read other formats (more on this later).  It will read it into a data structure called the <span style="color:red">DataFrame</span> which is not dissimilar to a spreadsheet table.  
<span style="color:red">import pandas as pd  
""" You need to download the file filename.csv to your Jupyter Notebook start folder."""  
data = pd.read_csv("filename.csv")</span>  
The first thing to do, in all Data Science work is to inspect it.  The <span style="color:red">DataFrame</span> data structure, which in this case has been stored into a variable named <span style="color:red">data</span>, has a few built-in functions that can help us.  We access these functions that are built into the data structure with the dot (.) reference, e.g.,  
<span style="color:red">data.head()</span>  
This will display by default the first 5 rows.  We will go into exploratory data analysis next week.

### Reading from a Uniform Resource Locator (URL)

<span style="color:red">pandas</span>   also allows you to read directly from URLs.  Common examples would be to read a CSV file from a URL location.  Let's get the titanic dataset from a [github repository](https://github.com/datasciencedojo/datasets/blob/master/titanic.csv).  Do note that if you try to read it directly, there will be a formatting error, hence when reading from a repository like github or Kaggle ([www.kaggle.com](www.kaggle.com)), do get the "raw" format.  On the github page, there is a "Raw" button located near the top right of the screen.  Click on that and then copy the URL.

![P1](picture/P1.png)

Use that URL and read it into a DataFrame.  You can call it "data", or "titanic".

<span style="color:red">url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"</span>  
<span style="color:red">titanic = pd.read_csv(url)  
titanic.tail()</span>  

The <span style="color:red">tail()</span> function returns the last 5 rows of the dataset.

You can also read HTML from URLs.  Let's try to use the BBC listing for the Scottish Premier League football table.

<span style="color:red">scot_football = pd.read_html('https://www.bbc.com/sport/football/scottish-premiership/table')</span>  
<span style="color:red">scot_football</span>  

You will notice that the output is not quite so pretty.  The <span style="color:red">pandas.read_html()</span> relies on the HTML format having some form of tabular layout as it needs it to be read into a DataFrame structure, hence not all HTML can be read.  To make it pretty, we can inform it that we want the first table (in most cases, there will only be one table but if there are more, you can simply access using the appropriate index).

<span style="color:red">scot_football = pd.read_html('https://www.bbc.com/sport/football/scottish-premiership/table')[0]</span>  
<span style="color:red">scot_football</span>  

You will get the data in the DataFrame format which is tabular and easier for us to visualise.

### Reading from Relational Databases

More often than not, a data scientist will be required to extract data from relational databases.  In this semester, many of you will also be taking or have taken the course F28DM Database Management. As a data scientist, SQL will be one of your top 3 tools that you will frequently use.  There are many Python libraries for the various databases, and one of the most popular Python libraries is [SQLAlchemy](https://www.sqlalchemy.org).  Let's import this and set up a connection to the database.  For this, you will need to locate the "northwind.db" database which is in the SQLite format.

<span style="color:red">import sqlalchemy  
conn = sqlalchemy.create_engine('sqlite:///northwind.db') # this may be a but tricky for some to locate the file  
data = pd.read_sql_query('SELECT * from Orders;', conn)  
data.head()</span>  

Your challenge may be in downloading and copying the database to the right location.  Note that it has a triple / which means a relative position and this means it is located in the same folder as your <span style="color:red">.ipynb</span>  file.  Do ask if you are unsure what we have just mentioned.

### Explore Northwind and data.gov.uk

The above are some introductory examples of getting data from a file, from an online location (URL), and from a database (which you will more than likely be using when you work as a Data Analyst/Engineer/Scientist).  Do read up more about them, and try them out.  You can get information about the "Northwind" database, which is a fictitious database created by Microsoft for tutorial purposes.  It has since then been used by others. It contains fictitious data of a food import and export company called "Northwind Traders".  The data comprises of sales data and is an excellent schema for small-business Enterprise Resource Planning, with customers, orders, inventory, purchasing, suppliers, shipping, and employees tables/data.

Another source of data, a global initiative by the World Bank, is the [Open Data Toolkit Initiative](https://opendatatoolkit.worldbank.org/en/data/opendatatoolkit/home). Almost every country in the world has initiated this through a website.  Depending on your campus location, do have a look at:

* https://www.data.gov.uk/  
* https://data.gov.scot/  
* https://opendata.fcsc.gov.ae/search (currently may have some minor error loading)  
* https://data.gov.my/  

There may be data that is of interest to you, and feel free to plan a project using the data (maybe even equivalent data from all 3). If you are looking for a place to start, you may look at the Scottish Heating Demand:

<span style="color:red">scot_heating = pd.read_csv('https://heatmap.data.gov.scot/downloads/Settlements_Heat_Demand.csv')</span>

We have now looked at various sources of data. There are more possibilities and another common source is through what we call the Application Programming Interface (API) which we will look at towards the end of this course.  We now have a few DataFrames, and we will continue to work on using the df DataFrame.

### Handling DataFrames

ou can print out the DataFrame to see what it looks like:

<span style="color:red">df</span>

#### Exercise 2.1: 

Create another data table called "df2" that contains the height of some of the students. It should contain the following data:

![P2](picture/P2.png)

**Column and Row selection:**  
There are a few ways to select columns and rows.  Here we select a column using its name and the bracket syntax:

<span style="color:red">df['FirstName']</span>

Or using the somewhat simpler dot notation:

<span style="color:red">df.FirstName</span>

We can also select only elements that fit certain conditions. What does the following command produce?

<span style="color:red">df[df['FirstName'] == 'Alex']</span>

Why does <span style="color:red">"df"</span> appear twice in the command? If you are unsure, try without the outer <span style="color:red">df [ ]</span>, e.g.,  <span style="color:red">df['FirstName'] == 'Bob'</span>.

#### Exercise 2.2: 

Show the details of the students who enrolled in the year 2011. We can select rows with ever more complicated conditions, for example:

<span style="color:red">""" select rows where first-name isn't 'Bob' (!= is not equal) and student ID is larger than 68823"""  
filt = (df.FirstName != 'Bob') & (df.StudentID > 68823)  
df[filt]</span>  

Before you continue to the next section, explore how you can select different columns and rows of the DataFrame that you have created above.

#### Exercise 2.3: 

Read the file "<span style="color:red">tutorial2.csv</span>" into a DataFrame called <span style="color:red">df3</span> .  The CSV file can be downloaded from the Canvas page.  
Hint: <span style="color:red">df3 = pd.read_csv("some file name")</span> 

Modifying the structure of a table
It is possible to modify the structure of DataFrames in Python. For example, we might want to add a new column containing the total of the "<span style="color:red">Math</span> " and "<span style="color:red">English</span> " marks for the students. To do this we can simply write:

<span style="color:red">df['Total'] = df['Math'] + df3['English']</span> 

The code above will create a new column called “<span style="color:red">Total</span> ” and populate it with the sum from the two columns “<span style="color:red">Math</span> ” and “<span style="color:red">English</span> ”. Print out the table to see the result.

<span style="color:red">df</span> 

Do note that the DataFrame df does not have the column '<span style="color:red">English</span> '.  We can simply create it:

<span style="color:red">df['English'] = df3['English']</span> 

This is actually a risky way as we don't really know if they are from the same student.  We will look at <span style="color:red">merge()</span>  in a short while to ensure that this is the case but meanwhile do keep this in mind.

#### Exercise 2.4:

Add a new column showing the average mark of "<span style="color:red">Math</span>" and "<span style="color:red">English</span>" for each student. (Average = Total/number of subjects).  Do take note of the resulting data type.

#### Exercise 2.5:

Write a filter to select students who have scored more than 150 in total and who have achieved a subject score (in Maths or English) of more than 90. (Hint: careful with the parenthesis).  This should have only one result.

Oftentimes, we need to modify the layout of a table so that the data is in the right format for further processing or visualisation. In our current table we have two columns "<span style="color:red">Math</span>" and "<span style="color:red">English</span>", both of which contain student marks. We can split the information for each student into different rows using the "<span style="color:red">melt</span>" command, as follows:

<span style="color:red">""" turn column names 'Math' and 'English' into values for a new column 'Subject'"""  
df = pd.melt(</span>  
<p style="margin-left: 40px;"><span style="color:red">df,</span></p>  
<p style="margin-left: 40px;"><span style="color:red">id_vars=['EnrolYear','FirstName','StudentID'],</span></p>  
<p style="margin-left: 40px;"><span style="color:red">value_vars=['Math','English'],</span></p>  
<p style="margin-left: 40px;"><span style="color:red">var_name='Subject')</span></p>  

Print out the new table.

<span style="color:red">df</span>

![P3](picture/P3.png)

What have we done here? We have taken each record (row) from the original table, and turned it into two rows. What does the new column "<span style="color:red">Subject</span>" contain?  Do refer to the [pandas documentation on melt](https://pandas.pydata.org/docs/reference/api/pandas.melt.html). for further information about the parameters "<span style="color:red">id_vars</span>", "<span style="color:red">value_vars</span>" and "<span style="color:red">var_name</span>".

**Renaming columns**

The values that were in the columns Math and English now appear in a different column "<span style="color:red">value</span>". We can rename this new column to make the new table more understandable:  
<span style="color:red">df.rename(columns = {'value':'Score'}, inplace = True)</span>

**Merging DataFrames**

Remember that earlier, we simply added a column '<span style="color:red">English</span>' to the <span style="color:red">df</span> DataFrame but the order of the marks may not be the same and also the number of marks may not be of the same length.  Let's suppose that we reset our <span style="color:red">df</span> DataFrame to what it was - note that by this stage, <span style="color:red">df</span> has already been modified many times.

<span style="color:red">df = pd.DataFrame({  
  'StudentID' : [68822,68823,68844,68845,68846],  
  'FirstName' : ['Steven','Alex','Bill','Mark','Bob'],  
  'EnrolYear' : [2010,2010,2011,2011,2013],  
  'Math'      : [100,90,90,40,60]  
})</span>  

And we read the CSV file with the English marks from "tutorial2.csv" as per Exercise 2.2 (please re-read it into <span style="color:red">df3</span>).  We can then use the pandas function <span style="color:red">merge()</span> to ensure that the right student marks are combined correctly.

<span style="color:red">df4 = pd.merge(df, df3, on=['StudentID','FirstName'])</span>

What is the difference that you can see after merging to datasets?  For those of you who are doing F28DM, this is similar to the JOIN statement in SQL.  Do explore through the pandas documentation on merge about the "<span style="color:red">on</span>" parameter.  In the resulting DataFrame, is there a student missing?

In the example provided, there are no students missing as both DataFrames have the same set of students.  You can modify either <span style="color:red">df</span> or the <span style="color:red">tutorial2.csv</span> file to include some new names.  Do the <span style="color:red">merge</span> again and what parameter do you think you need to include all students? 

This tutorial has left some things for you to ponder upon and it has covered DataFrames, reading from common sources (except APIs which we will cover later), managing DataFrames and integrating different data from different sources.  In a typical real-life situation, the actual data will not have the appropriate column names but that is also taken into consideration in the '<span style="color:red">on</span>' parameter for the <span style="color:red">merge()</span> function.  Have fun, there is lots to explore before we continue to work on the data next week.

## My code part

### Import the library

Use the command line "py -m pip install XX"

### Using the Library

In [24]:
import numpy as np
import matplotlib as mpl
import pandas as pd
import sklearn as skl

In [29]:
np.sin(np.pi/2)

np.float64(1.0)

### Starting Data Science Projects

In [30]:
df = pd.DataFrame({
  'StudentID' : [68822,68823,68844,68845,68846],
  'FirstName' : ['Steven','Alex','Bill','Mark','Bob'],
  'EnrolYear' : [2010,2010,2011,2011,2013],
  'Math'      : [100,90,90,40,60]
})

### Reading from a CSV file

In [4]:
# You need to download the file filename.csv to your Jupyter Notebook start folder.
data = pd.read_csv("data/filename.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,names,ages,heights
0,1,Bill,76,1.55
1,2,Ted,82,1.69
2,3,Henry,104,1.49
3,4,Joan,78,1.57
4,5,Ian,23,1.71


### Reading from a Uniform Resource Locator (URL)

In [5]:
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic = pd.read_csv(url)
titanic.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [6]:
scot_football = pd.read_html('https://www.bbc.com/sport/football/scottish-premiership/table')
scot_football

[    Position           Team  Played  Won  Drawn  Lost  Goals For  \
 0          1         Celtic      29   24      3     2         87   
 1          2        Rangers      29   18      5     6         59   
 2          3      Hibernian      29   11     10     8         44   
 3          4       Aberdeen      29   12      6    11         38   
 4          5  Dundee United      29   11      8    10         36   
 5          6     Motherwell      29   11      4    14         34   
 6          7         Hearts      29   10      6    13         41   
 7          8    Ross County      29    9      8    12         31   
 8          9     St. Mirren      29   10      4    15         35   
 9         10     Kilmarnock      29    8      7    14         33   
 10        11         Dundee      29    7      7    15         41   
 11        12  St. Johnstone      29    7      4    18         32   
 
     Goals Against  Goal Difference  Points  \
 0              17               70      75   
 1     

In [7]:
scot_football = pd.read_html('https://www.bbc.com/sport/football/scottish-premiership/table')[0]
scot_football

Unnamed: 0,Position,Team,Played,Won,Drawn,Lost,Goals For,Goals Against,Goal Difference,Points,"Form, Last 6 games, Oldest first"
0,1,Celtic,29,24,3,2,87,17,70,75,WResult WinWResult WinWResult WinLResult LossW...
1,2,Rangers,29,18,5,6,59,26,33,59,WResult WinWResult WinWResult WinLResult LossW...
2,3,Hibernian,29,11,10,8,44,40,4,43,DResult DrawWResult WinDResult DrawWResult Win...
3,4,Aberdeen,29,12,6,11,38,46,-8,42,LResult LossLResult LossWResult WinWResult Win...
4,5,Dundee United,29,11,8,10,36,36,0,41,LResult LossLResult LossLResult LossWResult Wi...
5,6,Motherwell,29,11,4,14,34,48,-14,37,LResult LossLResult LossLResult LossLResult Lo...
6,7,Hearts,29,10,6,13,41,40,1,36,WResult WinWResult WinLResult LossWResult WinW...
7,8,Ross County,29,9,8,12,31,49,-18,35,DResult DrawLResult LossWResult WinWResult Win...
8,9,St. Mirren,29,10,4,15,35,47,-12,34,WResult WinLResult LossDResult DrawWResult Win...
9,10,Kilmarnock,29,8,7,14,33,47,-14,31,LResult LossWResult WinWResult WinLResult Loss...


### Reading from Relational Databases

In [9]:
import sqlalchemy
conn = sqlalchemy.create_engine('sqlite:///data/northwind.db') # this may be a but tricky for some to locate the file
data = pd.read_sql_query('SELECT * from Orders;', conn)
data.head()

Unnamed: 0,OrderID,CustomerID,EmployeeID,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry
0,10248,VINET,5,2016-07-04,2016-08-01,2016-07-16,3,16.75,Vins et alcools Chevalier,59 rue de l-Abbaye,Reims,Western Europe,51100,France
1,10249,TOMSP,6,2016-07-05,2016-08-16,2016-07-10,1,22.25,Toms Spezialitäten,Luisenstr. 48,Münster,Western Europe,44087,Germany
2,10250,HANAR,4,2016-07-08,2016-08-05,2016-07-12,2,25.0,Hanari Carnes,"Rua do Paço, 67",Rio de Janeiro,South America,05454-876,Brazil
3,10251,VICTE,3,2016-07-08,2016-08-05,2016-07-15,1,20.25,Victuailles en stock,"2, rue du Commerce",Lyon,Western Europe,69004,France
4,10252,SUPRD,4,2016-07-09,2016-08-06,2016-07-11,2,36.25,Suprêmes délices,"Boulevard Tirou, 255",Charleroi,Western Europe,B-6000,Belgium


### Explore Northwind and data.gov.uk

In [10]:
scot_heating = pd.read_csv('https://heatmap.data.gov.scot/downloads/Settlements_Heat_Demand.csv')
scot_heating.head()

Unnamed: 0,SettlementName,SettlementCode,Area_m2,TotalHeatDemand_kwh_y,DemandDensity_kwh_y_m2
0,Creetown,S20001656,331816.7,7926779,23.889028
1,Hillhead,S20001780,229521.0,3984648,17.360716
2,Dumfries,S20001693,13359000.0,402384558,30.120865
3,Lockerbie,S20001864,2265415.0,69592642,30.719597
4,Dailly,S20001670,282831.8,7618381,26.936086


### Handling DataFrames

Exercise 2.1

In [11]:
df2 = pd.DataFrame({
  'Height' : [160,185,178,189],
  'Student': [68822,68823,68844,68845]
})
print (df2)

   Height  Student
0     160    68822
1     185    68823
2     178    68844
3     189    68845


In [12]:
print (df2['Height'])
print (df2.Height)
print(df2[df2['Height'] > 180])

0    160
1    185
2    178
3    189
Name: Height, dtype: int64
0    160
1    185
2    178
3    189
Name: Height, dtype: int64
   Height  Student
1     185    68823
3     189    68845


Exercise 2.2

In [13]:
# select rows where first-name isn't 'Bob' (!= is not equal)
# and student ID is larger than 68823
filt = (df.FirstName != 'Bob') & (df.StudentID > 68823)
df[filt]

Unnamed: 0,StudentID,FirstName,EnrolYear,Math
2,68844,Bill,2011,90
3,68845,Mark,2011,40


Exercise 2.3

In [14]:
df3 = pd.read_csv("data/tutorial2.csv")
print(df3)

   StudentID FirstName  English
0      68822    Steven       60
1      68823      Alex       70
2      68844      Bill       80
3      68845      Mark       80
4      68846       Bob       60


In [15]:
df['Total'] = df['Math'] + df3['English']
print(df['Total'])
df['English'] = df3['English']
print(df)

0    160
1    160
2    170
3    120
4    120
Name: Total, dtype: int64
   StudentID FirstName  EnrolYear  Math  Total  English
0      68822    Steven       2010   100    160       60
1      68823      Alex       2010    90    160       70
2      68844      Bill       2011    90    170       80
3      68845      Mark       2011    40    120       80
4      68846       Bob       2013    60    120       60


Exercise 2.4

In [16]:
df['Mean'] = df[['Math','English']].mean(axis=1)
print(df)

   StudentID FirstName  EnrolYear  Math  Total  English  Mean
0      68822    Steven       2010   100    160       60  80.0
1      68823      Alex       2010    90    160       70  80.0
2      68844      Bill       2011    90    170       80  85.0
3      68845      Mark       2011    40    120       80  60.0
4      68846       Bob       2013    60    120       60  60.0


Exercise 2.5

In [17]:
filt = (df['Math'] > 90) & (df['Total'] > 150) | (df['English'] > 90) & (df['Total'] > 150)
filt2 = ((df['Math'] > 90) | (df['English'] > 90)) & (df['Total'] > 150)
filtered_df = df[filt]
filtered_df2 = df[filt2]

print(filtered_df)
print(filtered_df2)
df

   StudentID FirstName  EnrolYear  Math  Total  English  Mean
0      68822    Steven       2010   100    160       60  80.0
   StudentID FirstName  EnrolYear  Math  Total  English  Mean
0      68822    Steven       2010   100    160       60  80.0


Unnamed: 0,StudentID,FirstName,EnrolYear,Math,Total,English,Mean
0,68822,Steven,2010,100,160,60,80.0
1,68823,Alex,2010,90,160,70,80.0
2,68844,Bill,2011,90,170,80,85.0
3,68845,Mark,2011,40,120,80,60.0
4,68846,Bob,2013,60,120,60,60.0


In [18]:
# turn column names 'Math' and 'English' into values for a new column 'Subject'
df = pd.melt(
  df,
  id_vars=['EnrolYear','FirstName','StudentID'],
  value_vars=['Math','English'],
  var_name='Subject')

In [19]:
df

Unnamed: 0,EnrolYear,FirstName,StudentID,Subject,value
0,2010,Steven,68822,Math,100
1,2010,Alex,68823,Math,90
2,2011,Bill,68844,Math,90
3,2011,Mark,68845,Math,40
4,2013,Bob,68846,Math,60
5,2010,Steven,68822,English,60
6,2010,Alex,68823,English,70
7,2011,Bill,68844,English,80
8,2011,Mark,68845,English,80
9,2013,Bob,68846,English,60


In [20]:
df.rename(columns = {'value':'Score'}, inplace = True)
df

Unnamed: 0,EnrolYear,FirstName,StudentID,Subject,Score
0,2010,Steven,68822,Math,100
1,2010,Alex,68823,Math,90
2,2011,Bill,68844,Math,90
3,2011,Mark,68845,Math,40
4,2013,Bob,68846,Math,60
5,2010,Steven,68822,English,60
6,2010,Alex,68823,English,70
7,2011,Bill,68844,English,80
8,2011,Mark,68845,English,80
9,2013,Bob,68846,English,60


In [21]:
df = pd.DataFrame({
  'StudentID' : [68822,68823,68844,68845,68846],
  'FirstName' : ['Steven','Alex','Bill','Mark','Bob'],
  'EnrolYear' : [2010,2010,2011,2011,2013],
  'Math'      : [100,90,90,40,60]
})
df

Unnamed: 0,StudentID,FirstName,EnrolYear,Math
0,68822,Steven,2010,100
1,68823,Alex,2010,90
2,68844,Bill,2011,90
3,68845,Mark,2011,40
4,68846,Bob,2013,60


In [22]:
df4 = pd.merge(df, df3, on=['StudentID','FirstName'])
df4

Unnamed: 0,StudentID,FirstName,EnrolYear,Math,English
0,68822,Steven,2010,100,60
1,68823,Alex,2010,90,70
2,68844,Bill,2011,90,80
3,68845,Mark,2011,40,80
4,68846,Bob,2013,60,60
