## Exploratory Data Analysis is Clarendon Scholars



**Getting data of all previous scholars and analyzing it to get insights within the data**


### Context

After few years of graduation from undergraduates studies, I have continually searched for scholarships and funding opportunities to further my studies. Never heard of this particular one. It was posted on linkedin by a third party connect on Linkedin. It piqued my curiosity. I did some digging on their website, found some helpful information and  the list of previous scholars. I became even more curious to learn the number of Nigerias or Africans that have enjoyed the scholarship since it began. A good opportunity to apply my programming skills to find answers to all my curiosity. Thus the need for this project. 


Clarendon not only offers over 150 new, fully-funded scholarships each year to assist outstanding graduate scholars, but offers the opportunity to join one of the most active, highly international, and multidisciplinary communities at Oxford.

Originally established to support Overseas students, the Clarendon Fund first welcomed scholars to Oxford in 2001. The scheme was expanded in 2012 to include students from the UK and EU, therefore providing funding for all fee statuses. Throughout this period, the Fund’s aim has remained unchanged; to assist academically outstanding graduate students through their studies at the University of Oxford.




3. Analysis of the data
   * Determine the top ten country with the highest scholars
   * College Top ten Scholars by College
   * Top ten course with the highest number of Scholars
   * Plot a visual of top ten Scholars and courses
   * Determine the number of African Scholars
   * Determine the number of Nigerian Scholars

   * Plot a visual of Scholars and courses


### Getting the Data
You can find the data used from this analysis [here](https://www.ox.ac.uk/clarendon/scholar-class-lists/scholars-2020-21)

In [2]:
import pandas as pd
import numpy as np
import requests
import seaborn as sns
import plotly.express as px
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"}


In [3]:
url = "https://www.ox.ac.uk/clarendon/scholar-class-lists"

In [4]:
html_doc = requests.get(url, headers=headers).text
clarendon_soup = BeautifulSoup(html_doc, "html5lib")


In [5]:
# Get the all page content
page_content = clarendon_soup.find("nav", id="block-menu-block-10")


In [6]:
# Grab the list to all the page content
all_pages = page_content.find_all("li")
all_pages

[<li class="first leaf menu-mlid-8570"><a href="/clarendon/scholar-class-lists/scholars-2020-21">Scholars 2020-21</a></li>,
 <li class="leaf menu-mlid-8573"><a href="/clarendon/scholar-class-lists/scholars-2019-20">Scholars 2019-20</a></li>,
 <li class="leaf menu-mlid-9405"><a href="/clarendon/scholar-class-lists/scholars-2018-19">Scholars 2018-19</a></li>,
 <li class="leaf menu-mlid-8883"><a href="/clarendon/scholar-class-lists/scholars-2017-18">Scholars 2017-18</a></li>,
 <li class="leaf menu-mlid-5679"><a href="/clarendon/scholar-class-lists/scholars-2016-17">Scholars 2016-17</a></li>,
 <li class="last leaf menu-mlid-8582"><a href="/clarendon/scholar-class-lists/previous-scholars">Previous scholars</a></li>]

In [7]:
# Function that extract all the scholars from the web page into a dataframe
def generate_df(pages):
    clanderon_df = pd.DataFrame(columns=["Name", "Country", "Course", "College"])
    for i in range(len(all_pages)):
        query = "-".join(all_pages[i].string.split(" "))
        page_url = f"https://www.ox.ac.uk/clarendon/scholar-class-lists/{query}"
        
        page = requests.get(page_url, headers=headers).text
        soup = BeautifulSoup(page, "html5lib")
        
        for row in soup.find("tbody").find_all("tr"):
            col = row.find_all("td")
            name = col[0].text
            country = col[1].text
            course = col[2].text
            college = col[3].text
            
            clanderon_df = clanderon_df.append({"Name": name, "Country":country, "Course":course, "College":college}, 
                                               ignore_index=True)
    return clanderon_df

In [8]:
scholars = generate_df(all_pages)

In [10]:
scholars.head()

Unnamed: 0,Name,Country,Course,College
0,Abdul Rad,United States of America,DPhil in Sociology (PT),Nuffield College
1,Abheek Ghosh,India,DPhil in Computer Science,Exeter College
2,Abhishek Ranjan Datta,India,DPhil in International Development,Lincoln College
3,Ahmed Tohamy,Egypt,DPhil in Economics,Nuffield College
4,Ajantha Abey,Australia,"DPhil in Physiology, Anatomy and Genetics",Keble College


In [11]:
# Drop the name column for privacy reasons.
scholars_df = scholars.drop(columns="Name", axis=1)

In [12]:
# Save the clarendon/scholar-class-lists to local storage
scholars_df.to_excel("data/clarendon-scholars.xlsx", index=False)


### Analysis of the Data

We will try to understand the data and answer the project questions by placing it in a visual context so that patterns, trends and correlations that might not otherwise be detected can be exposed. I will be using the `seaborn` library based on Matplotlib. It provides a high-level interface for creating attractive graphs.

In [81]:
#Load the clarendon scholar from local storage
scholar_data = pd.read_excel("data/clarendon-scholars.xlsx")

# Understanding the basic information of my data
def all_about_my_data(df):
    print("Here are some basic ground information about my data:\n")
    
    # Shape of the dataframe
    print("Number of Instances:", df.shape[0])
    print("Number of Features:", df.shape[1])
    
    # Summary of stat
    print("\nSummary Stats:")
    print(df.describe())
    
    # Missing value inspection
    print("\nMissing Values:")
    print(df.isna().sum())

    
all_about_my_data(scholar_data)

Here are some basic ground information about my data:

Number of Instances: 784
Number of Features: 3

Summary Stats:
               Country                      Course          College
count              784                         784              784
unique              82                         228               50
top     United Kingdom  DPhil in Clinical Medicine  Balliol College
freq               156                          25               52

Missing Values:
Country    0
Course     0
College    0
dtype: int64


In [82]:
scholar_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 784 entries, 0 to 783
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Country  784 non-null    object
 1   Course   784 non-null    object
 2   College  784 non-null    object
dtypes: object(3)
memory usage: 18.5+ KB


In [83]:
len(scholar_data["Country"].unique())

82

Everything looks fine and appears we can proceed with getting insights from our data. 

**BUT WAIT!**

The name of some countries in the `Country` column appear to be misspelled. For example `United Kingdon` instead of `United Kingdom`. There is also the conundron of `England` and `United Kingdom`. There is also an instance of `Country` encoded as `\xa0`.

In [84]:
#@title Correct spelling mistakes, correct instances of repeated countries, repace \xa0 with None
scholar_data["Country"] = scholar_data["Country"]\
.str.replace("UnitedStatesofAmerica", "United States of America")\
.str.replace("England", "United Kingdom")\
.str.replace("Korea, Republic of (South)", "South Korea")\
.str.replace("United Kingdon", "United Kingdom")\
.str.replace("\xa0", "Missing")


  scholar_data["Country"] = scholar_data["Country"]\


In [85]:
def replace_duplicate(x):
    if(x == "Hong Kong (SAR)"):
        return "Hong Kong"
    elif (x == "Russia (Russian Federation)"):
        return "Russia"
    elif (x == "Korea, Republic of (South)"):
        return "South Korea"
    else:
        return x

In [86]:

scholar_data["Country"] =  scholar_data["Country"].apply(replace_duplicate)
len(scholar_data["Country"].unique())

77

### The previous Scholars - Country, Numbers

In [94]:
scholar_by_country = scholar_data["Country"].value_counts()[:10]

In [93]:

def create_barchart(df, ylabel, title):
    fig = px.bar(df, x=df.values, y=df.index, text=df.values, orientation="h")
    fig.update_traces(textposition='inside')
    fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
    fig.update_layout(
    title_text=title,
    xaxis={ "title": "Number of scholars",  "showticklabels": False, "showgrid":False},
    yaxis_title = ylabel,
    yaxis={"categoryorder": "total ascending"},
    template='plotly_white',

    )

    fig.show()
    
create_barchart(scholar_by_country, "Country", "Clarendon Scholars: Top Ten Countries")

* The United Kingdom have the highest number of previous scholars with 167. 

Is this an indication that citizen of the home country stand a higher chance of being granted the scholarship over other countries?

**OR**

Was this just coincidence?

### The previous Scholars - College, Numbers

In [105]:
scholars_by_college = scholar_data["College"].value_counts()[:10]
create_barchart(scholars_by_college, "Colleges", "Clarendon Scholars: Top Ten Colleges")

* Balliol College has 52 previous scholars(highest). Jesus College and Keble College have had 46 previous scholars.

### The previous Scholars - Course, Numbers

In [106]:
scholas_by_course = scholar_data["Course"].value_counts()[:5]
create_barchart(scholas_by_course, "Courses", "Clarendon Scholars: Top Five Courses")

* Since inception to 2020, `Dphil in Clinical Medicine` course have the highest number of scholars. `25`


### The previous Scholars - African Descent, Numbers

In [107]:
# Extract all African countries into a list
african_countries = ["Egypt", "South Africa", "Nigeria", "Kenya", "Benin", "Sudan", "Mozambique"]

In [108]:
# Extract the african countries from all the data in the dataframe
afri_scholars = scholar_data[scholar_data["Country"].isin(african_countries)]
afri_scholars.head()

Unnamed: 0,Country,Course,College
3,Egypt,DPhil in Economics,Nuffield College
22,South Africa,MSc in Surgical Science and Practice,Kellogg College
132,Nigeria,MSc in Integrated Immunology,Green Templeton College
187,South Africa,DPhil in Clinical Medicine,St John's College
226,South Africa,MSc in Neuroscience,Keble College


In [109]:
rest_of_the_world_scholars = scholar_data.drop(afri_scholars.index)
rest_of_the_world_scholars.head()

Unnamed: 0,Country,Course,College
0,United States of America,DPhil in Sociology (PT),Nuffield College
1,India,DPhil in Computer Science,Exeter College
2,India,DPhil in International Development,Lincoln College
4,Australia,"DPhil in Physiology, Anatomy and Genetics",Keble College
5,United States of America,DPhil in Geography and the Environment,Oriel College


In [110]:
#Percentage of African scholars
percentage_of_afri_scholars = len(afri_scholars) / len(scholar_data) * 100
percentage_of_afri_scholars

3.826530612244898

In [111]:
percentage_rest_of_the_world = len(rest_of_the_world_scholars ) / len(scholar_data) * 100
percentage_rest_of_the_world

96.1734693877551

In [112]:
#Plot a pie char to the percentage of african scholars compared to the rest of the world

import plotly.graph_objects as go

labels = ["African scholars", "The rest of the world"]
values = [percentage_of_afri_scholars, percentage_rest_of_the_world]

fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.show()

* Only 3.83% of the previous scholars are from Africa

In [113]:
# Number of African scholars

print(f"""
      For the period under review, {afri_scholars.value_counts().values.sum()} 
      African citizens from {len(african_countries)} enjoyed the prestigious Clarendon scholarship. 
      This represents {percentage_of_afri_scholars:.2f}% of all the scholars
      """)


      For the period under review, 30 
      African citizens from 7 enjoyed the prestigious Clarendon scholarship. 
      This represents 3.83% of all the scholars
      


In [114]:
# Plot the distribution of all seven African countries according

by_country = afri_scholars["Country"].value_counts()
create_barchart(by_country, "African Countries", "Clarendon Scholars: African Scholars By country")

* South Africa have the highest number of previos scholars from Africa. They have 21 in total.

#### Determine the number scholarship award to students working towards DPhil, graduate degree such as MPhil or BPhil, or one-year degrees, such as MSc, MSt, MBA or MFE.

In [162]:
d6 = scholar_data.groupby(["Country","Course"])
d6.first()

Unnamed: 0_level_0,Unnamed: 1_level_0,College
Country,Course,Unnamed: 2_level_1
Albania,DPhil in Politics,St Antony's College
Australia,BPhil in Philosophy,Magdalen College
Australia,DPhil in Ancient History,Merton College
Australia,DPhil in Archaeological Science,St Hugh's College
Australia,DPhil in Archaeology,The Queen's College
...,...,...
United States of America,MSt in US History,Somerville College
United States of America,MSt in World Literatures in English,Exeter College
United States of America,Master of Fine Art,Wadham College
United States of America,Master of Public Policy,University College


In [153]:
d6.get_group("BPhil in Philosophy").count()

Country    3
Course     3
College    3
dtype: int64