# Learning Objectives:

By the end of this session, participants should be able to:

* Use Panda’s DataFrame for basic data wrangling
* Use Scipy/Numpy to conduct descriptive stats and t-test
* Create a simple visualization of data using Seaborn
* Describe how API works
* Use BeautifulSoup to scrape data from a web page
* Use NLTK for basic data pre-processing and word count
* Describe how to use GitHub publish their code repository


In [None]:
!pip install nltk

In [None]:
# import the necessary packages
import pandas as pd
import seaborn as sns
from bs4 import BeautifulSoup
import requests as rq 
import json
import re

## Dataframe 101 (60 minutes)
- reading from csv
- describe and info
- rename columns
- slicing data
- filtering data, group by
- counting values
- simple plots
- deriving new columns, dropping columns

In [None]:
# reading from CSV
# youth-survey.csv consists of responses from youths (15-30 years old) in Czech Republic 
data = pd.read_csv('youth-survey.csv')
data

### Retrieve basic info about the dataframe

In [None]:
# Use the DataFrame.info() method to find out more about a dataframe.
data.info()

In [None]:
# The DataFrame.columns variable stores information about the dataframe’s columns.
# This one doesn't have parentheses because it's not a function, but a 
# variable inside the dataframe object (member variable)

data.columns

In [None]:
# quickly get the number of rows and columns of the dataframe

data.shape

In [None]:
# DataFrame.describe() gets the summary statistics of the columns that have numerical data. 
# All other columns are ignored, unless you use the argument include='all'.

data.describe()

### Renaming columns

In [None]:
# Sometimes column names need to be renamed to mae it easier for us
# rename columns to be all lowercaps with no whitespace (replace whitespace with hyphen)
# rename them to something more meaningful

data.rename(columns = {
    "Energy levels": "energy-levels",
    "Internet usage": "internet-usage",
    "Loneliness": "loneliness",
    "left - right" : "dominant-hand",
    "Village - town": "locality"
}, inplace=True)

data.columns

### Selecting a subset of dataframe ("slicing")

In [None]:
# Selecting a subset ("slicing") 
# get the age of participants
data["age"]

In [None]:
# Describe just a column
data["age"].describe()

In [None]:
# get the height and age of participants
data[["height", "age"]]

In [None]:
# Describe the two columns
data[["height", "age"]].describe()

***Try Yourself:***
* Get the loneliness, happiness, and energy-levels columns

### Filtering the data to fit specified criteria

In [None]:
# Filtering: Get all data from participants above 18 years old
criteria = data["age"] > 18
data_above_18 = data[criteria]

data_above_18.head(10)

In [None]:
# Get data from Female participants above 18 years old
criteria = (data["age"] > 18) & (data["gender"] == "female")
data_female_18 = data[criteria]

data_female_18.head(10)

In [None]:
# get participants that stated their age
criteria = data["age"].notna()
data_age_known = data[criteria]

data_age_known.shape

***Try Yourself:***
* Get data from people whose happiness ratings are >= 3 or loneliness rating <= 3

In [None]:
# Even more granular filtering:
# get the internet-usage information of city-dwelling participants

# we can of course do it in two steps: filter the row based on the locality, and then slice the internet-usage column
# using .loc, we can filter both criteria at one go

criteria = (data["locality"] == "city")
city_dwellers_internet_usage = data.loc[criteria, "internet-usage"]

city_dwellers_internet_usage.head(10)

In [None]:
# retrieve based on index number instead of column names or row values
# retrieve the first 3 rows only

data[0:3]

In [None]:
# retrieve "lying" values (2nd column) of row 5 to 10
# use .iloc to perform this filtering+slicing in one go

data.iloc[5:11, 1]

***Try Yourself:***
* Get the happiness and loneliness rating of participants with more than 1 siblings

### Updating values

In [None]:
# we can also update the values in dataframe, especially for the empty ones
# update the missing siblings value to 0
# inplace = True so that the changes are applied to the dataframe itself

data["siblings"].fillna(0, inplace = True)

***Try Yourself:***
* Update the missing values in gender to "No Gender"
* Update the the values "left handed" to "l" and right handed to "r" (hint: you can use .loc for this!)

### Counting and Sorting Values

In [None]:
# Find out how many participants are female or male
data["gender"].value_counts()

In [None]:
# Find out how many participants are female or male from villages and towns
data[["gender", "locality"]].value_counts()

In [None]:
# sort the age of participants from youngest to oldest
data.sort_values(by='age', inplace=True)
data.head(15)

***Try Yourself:***
* Include the NaN value when counting the number of female and male participants
* Sort participants based on happiness rating, from highest to lowest

### Creating new column based on other columns, dropping columns

In [None]:
# create a new column called "height-in-m", deriving from the "height" column
data["height-in-m"] = data["height"] / 100 # no need to use loops as dataframe will perform this automatically to all values in the column
data.head(10)

In [None]:
# drop a column
data.drop(columns=["height-in-m"], inplace=True)
# inplace=True means apply the change to the current dataframe

#check the columns now
data.columns

### Simple plots with Seaborn

In [None]:
sns.set_theme(style="ticks", color_codes=True)

# create a histogram for Energy Levels data
sns.displot(data["energy-levels"], discrete=True, shrink=.8)

In [None]:
# you can also create a scatterplot
sns.stripplot(x="punctuality", y="age", data=data, hue="punctuality")

In [None]:
sns.lmplot(x="age", y="happiness", data=data);

In [None]:
sns.lmplot(x="age", y="happiness", hue="locality", data=data);

## Stats with Dataframe

- mean, mode, median, std, etc
- correlation

In [None]:
# Calculate the average age of the participant
data["age"].mean()

In [None]:
# Calculate the median age of the participant
data["age"].median()

In [None]:
# What's the most common age among participants?
data["age"].mode()

In [None]:
# what's the average age of female and male participants?
grouped_data = data.groupby(by=["gender"]).mean()
grouped_data["age"]

***Try Yourself:***
* Find out the average loneliness rating of participants grouped by their gender and locality

In [None]:
#### inferential stats ####
# are there any relationship between these two variables?
# let's check the value for pearson's r
pearson = data["happiness"].corr(data["age"])
pearson

In [None]:
# to get the p-value, we can use scipy's spearmanr function
from scipy import stats

# the data has some NaN values, let's replace them with 0
data["happiness"].fillna(0, inplace = True)
data["age"].fillna(0, inplace = True)

spearman = stats.spearmanr(data["happiness"], data["age"])
spearman
# first value is the coefficient, and the second value is the p-value

In [None]:
# we can also quickly calculate the correlation coefficient between numerical variables
# and keep them in a matrix
corr_matrix = data.corr()
corr_matrix

In [None]:
# show the matrix in a heatmap using seaborn
sns.heatmap(corr_matrix)

## API and Web scraping (50 Mins + 5 mins break)

* API calls in Python (20 Minutes)
** URL for CORE API: https://api.core.ac.uk/v3/search/works/?q=singapore+language.code:en+yearPublished%3E2015
* Simple web-scraping (30 Minutes)

### Use CORE API to retrieve articles in CORE about Singapore, published after 2015.

* brief lecture on anatomy of API --> e.g. for retrieving a single paper using its DOI
* how to read API documentation

In [None]:
# Read the apikey and prepare the api call
with open("apikey.txt") as f:
    api_key = f.read()
    
api_call = "https://api.core.ac.uk/v3/search/works/?q=singapore+language.code:en+yearPublished%3E2015&api_key=" + api_key

In [None]:
# call the API take a peek on the response in JSON format
core_response = rq.get(api_call)

# check the status code
core_response.status_code

In [None]:
# Tell python to "format" this string of text as JSON. 
response_json = core_response.json()

# Python will then save the info into what we call dictionary, which can then be loaded into a dataframe
# note that if the structure of dictionary is a bit complicated, you may need to unravel it first. 
response_json

### Extracting fulltext from a single HTML page

Link to article used in this notebook: https://crl.acrl.org/index.php/crl/article/view/24753/32576

In [None]:
url = "https://crl.acrl.org/index.php/crl/article/view/24753/32576"
markup = rq.get(url).text

In [None]:
# parse content
soup = BeautifulSoup(markup, 'lxml') # with lxml parser
# soup = BeautifulSoup(markup, 'html.parser') # with html5 parser

In [None]:
# full text content is located inside a "div" element inside div#content. 
# it's always placed as the first child and the only div without class
# the CSS selector below will only work for ACRL full-text article pages.
content = soup.select("div#content > div:not(.block)")
print(content)

In [None]:
# go through the tags contained inside the retrieved div and apply the checks, etc.
# print out the result
fulltext = ""
if (content):
    for div in content:
        for tag in div:
            fulltext += tag.text + "\n" #save everything inside this div regardless of tag type
            if tag.name == "h1":
                print("Title: ", tag.text)
            elif tag.name == "p" or tag.name == "ul":
                print("Text:", tag.text)
            elif tag.name == "h2" or tag.name == "h3": 
                print("Headings:", tag.text)
            elif tag.name == "div":
                print("Tables:", tag.text)
else:
    print("content is empty; please check the HTML tags for soup.select")

In [None]:
# save to file
txt_file = open("crl_fulltext.txt","w", encoding='utf-8')
txt_file.write(fulltext)
txt_file.close() #to change file access modes

Brief lecture on:
* advantages of using API over webscraping
* available APIs to use (show API guide)

## Text analysis 101 (30 mins)

- The pre-processing
- Descriptive text analysis: WordCount

In [None]:
# load the JSON data, containing approx 200 articles from CORE
core_df = pd.read_json("core_articles.json")
core_df.head(5)

In [None]:
# We are not interested in all the columns, so let's get only the necessary columns
metadata_df = core_df[['createdDate', 'title', 'abstract', 'fullText', 'yearPublished']]
metadata_df

In [None]:
# import the necessary packages
import nltk
from nltk import word_tokenize #tokenizer
from nltk.corpus import stopwords #stopwords
from nltk.stem import WordNetLemmatizer #lemmatizer
from nltk.probability import FreqDist #to count words

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

**Goal: create a word cloud of the corpus to get an overview of the theme. Let's use the abstract for this exercise.**

Preprocessing steps:
* remove punctuations (remove non-alphanumeric characters)
* convert to lowercase
* tokenize the words
* lemmatize the words
* remove stopwords

In [None]:
all_abstract = "" # empty string to hold all of the abstracts in one variable

# combine all the abstract contents into one string variable
# while combining, do step #1 remove non-alphanumeric characters
for abstract in metadata_df["abstract"]:
    if abstract:
        temp_string = re.sub('[^a-zA-Z0-9]', ' ', str(abstract)) # cleanup the text from non alphanumeric characters.
        all_abstract += str(temp_string)  #append the abstract content to all_abstract
        
all_abstract

In [None]:
# Step #2 convert to lowercase
all_abstract = all_abstract.lower()
all_abstract

In [None]:
# Step #3 tokenize the words with NLTK's word_tokenize
tokenized_abstract = word_tokenize(all_abstract)
tokenized_abstract

In [None]:
# Step #4 lemmatize the words and Step #5 remove stopwords

# prepare an empty list to hold all the filtered words
filtered_abstract = []

#initiate the lemmatizer
wnl = WordNetLemmatizer()

#initiate list of stopwords
stop_words = set(stopwords.words("english"))
#update stopwords to also include singapore
stop_words.update(["singapore"])

for word in tokenized_abstract:
    #lemmatize the word to its dictionary form
    word = wnl.lemmatize(word)
    
    #check if it's part of the stop_words
    if word not in stop_words:
        #extra checks if it's alphabetical and not a numeric word
        if word.isalpha():
            #if yes, add this word to the list of filtered_abstract
            filtered_abstract.append(word)
            
filtered_abstract

### Visualizing the result through frequncy list or word cloud

In [None]:
# Get the exact count of occurence of the words using nltk's FreqDist
fdist = FreqDist(filtered_abstract)
print("Top 50 most used words in the abstract: \n")

fdist.most_common(50)

In [None]:
# IF ON WINDOWS: download the following pre-build "wheel" file from https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud 
# and use the following pip command

!pip install wordcloud-1.8.1-cp39-cp39-win_amd64.whl

In [None]:
!pip install wordcloud

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# prepare the list of string
text = " ".join(filtered_abstract)

word_cloud = WordCloud(background_color = 'white', max_words=50, width=1200, height=800)
word_cloud.generate(text)

plt.figure()
plt.imshow(word_cloud)
plt.axis("off")
plt.show()

Based on the result, you may need to adjust the stopwords to get more meaningful wordclouds

***Try Yourself:***
* Do the pre-processing on the full text instead of abstract. Do you get a different list of words?
* Create a word cloud of papers published in 2016 vs 2017. Are there any differences?