<a id='sources'></a>
# Sources

While writing this assignment, I used a lot of online resources. In order to provide an accurate and complete list of sources, I created a python script that can easily output a list of sources from an extract of my browser history.

In [14]:
# This script opens up the data/history.csv and generates a sources list

# History.csv was generated from Chrome Browser history from the period over which I did the assignment
# Note: I pruned out personal and non-related sources 
# This pruning was quite quick because we tend to have a daily pattern of when and what we access online
# So I pruned out non-assignment links quickly using a spreadsheet.

import pandas as pd

# clean up the df

df = pd.read_csv (r'./data/history.csv', skiprows=0, index_col=0)
df.columns = ["order", "date","time","title","url","visitCount","typedCount","transition"]
df.drop(["order", "typedCount", "transition"], axis=1, inplace=True)

# the history.csv file was created using Google Sheets
# Google Sheets will only export CSV with commas
# Therefore I had to replace 290 commas with REPLACE-THIS-PLEASE before exporting to CSV
# I fix that now...

df.replace({'REPLACE-THIS-PLEASE': ','}, inplace=True, regex=True)

df.head()

Unnamed: 0_level_0,date,time,title,url,visitCount
order,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,4/15/2021,11:27:44,python - Simple way to measure cell execution ...,https://stackoverflow.com/questions/32565829/s...,2
2,4/15/2021,11:22:53,sklearn.ensemble.AdaBoostClassifier — scikit-l...,https://scikit-learn.org/stable/modules/genera...,3
3,4/15/2021,11:05:08,python - UndefinedMetricWarning: F-score is il...,https://stackoverflow.com/questions/43162506/u...,1
4,4/15/2021,10:57:21,python - How is scikit-learn cross_val_predict...,https://stackoverflow.com/questions/41458834/h...,1
5,4/15/2021,10:57:04,Python Examples of sklearn.cross_validation.cr...,https://www.programcreek.com/python/example/91...,1


In [15]:
# There are 2294 sources!!

df.shape

# Reduce the list...

(2294, 5)

In [16]:
# take out the duplicates

df = df.drop_duplicates(subset=["url"], keep='first')
df = df.drop_duplicates(subset=["title"], keep='first')

In [17]:
df.shape

(1185, 5)

In [18]:
# convert the column to numeric so I can query on it

df['visitCount'] = pd.to_numeric(df['visitCount'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1185 entries, 1 to 2281
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        1185 non-null   object
 1   time        1185 non-null   object
 2   title       1184 non-null   object
 3   url         1185 non-null   object
 4   visitCount  1185 non-null   int64 
dtypes: int64(1), object(4)
memory usage: 55.5+ KB


In [19]:
# Due to the large number of pages / sources, I will only include sources that I visited multiple times.
# This is an accurate representation of whether or not a page was used.
# Most of the time, if I visit a page just once, it is not too useful to my purposes.

# I could reduce this more >2 : 103 or >3 : 52

df = df.loc[df['visitCount'] > 1]
df.shape

(290, 5)

In [20]:
# Sort the list by the title

df.sort_values(by=["title"], inplace=True)

In [None]:
# This is the most effective method for this source list
# Typically in an academic paper the Author is the sorting column, next the Year
# In the case of these types online sources these details are not readily available

# I did explore that I could query meta-databases (Google Archive, Archive.com, Wayback Machine) 
# and extract meta key words and parse out of html to get this data about 
# pages but - this is beyond the scope of this source list

# PS. I might develop this capability into a script as a future personal github project 
# because reference lists consume huge amounts of time on academic assignments

In [21]:
# list is outputed by order of the Page Title
# List records access date, url, title, and web site (domain)

# Not strictly academic references but does acknowledge / cite the sources used

from IPython.core.display import display, HTML
from datetime import datetime
from urllib.parse import urlparse


for index, row in df.iterrows():
    link = "Available at: <a href='" + str(row['url']) + "'>" + str(row['url']) + "</a>"
    
    accessDate = datetime.strptime(row['date'], '%m/%d/%Y')
    
    source = urlparse(row['url']).netloc
    
    title = "<b>" + str(row['title']) +"</b><br>"
    

    display(HTML("<b>" + title + "</b>"))
    display(HTML("<i>" + source + "</i>"))
    display(HTML(link))
    display(HTML("<p>Accessed on: " + accessDate.strftime("%d %B %Y") + "</p><br><br>"))


**Note** 

I reviewed my source list and this script was an effective way of generating it.  I have confidence it is a very good representation of the sources that I referred to.