# CIS 545 - Big Data Analytics - Fall 2019

# Homework 1: Data Wrangling and Cleaning
# Due Date: September 25, 2019 at 10pm

We all know that cryptocurrencies are all the rage today.  Could we train an algorithm to tell the difference between a webpage about cryptocurrency and a webpage about something else?

This initial assignment goes over some of the basic steps in (1) acquiring data from the web, (2) acquiring tabular data, (3) cleaning and linking data, and (4) training a simple machine learning classifer.  Along the way you'll learn a few of the basic tools, and get a very basic understanding of one way to represent documents.

**Note: You do not need to connect your local runtime to do this assignment!**

In [3]:
# Standard pip install...  Put all of your to-install packages here.
# Depending on your configuration, you may need to change pip3 to pip
!pip3 install scrapy
!pip3 install lxml
!pip3 install scikit-learn
!pip3 install swifter

Collecting scrapy
[?25l  Downloading https://files.pythonhosted.org/packages/29/4b/585e8e111ffb01466c59281f34febb13ad1a95d7fb3919fd57c33fc732a5/Scrapy-1.7.3-py2.py3-none-any.whl (234kB)
[K     |████████████████████████████████| 235kB 2.8MB/s 
[?25hCollecting Twisted>=13.1.0; python_version != "3.4" (from scrapy)
[?25l  Downloading https://files.pythonhosted.org/packages/14/49/eb654da38b15285d1f594933eefff36ce03106356197dba28ee8f5721a79/Twisted-19.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.1MB)
[K     |████████████████████████████████| 3.1MB 15.9MB/s 
[?25hCollecting PyDispatcher>=2.0.5 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/cd/37/39aca520918ce1935bea9c356bcbb7ed7e52ad4e31bff9b943dfc8e7115b/PyDispatcher-2.0.5.tar.gz
Collecting parsel>=1.5 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/86/c8/fc5a2f9376066905dfcca334da2a25842aedfda142c0424722e7c497798b/parsel-1.5.2-py2.py3-none-any.whl
Collecting service-identity (from scrapy)
  Downloa

In [4]:
# Standard imports; it's cleaner to put them here so they can be used
# throughout the notebook

import pandas as pd
import numpy as np
from lxml import etree
import sqlite3
import swifter
import urllib
import re

import nltk
from nltk import classify
from nltk import NaiveBayesClassifier
from nltk.stem import PorterStemmer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


## Task 1: Acquiring data for training our system

First let's get some information about what's a cryptocurrency.  For that -- there's always [Wikipedia](https://en.wikipedia.org/wiki/List_of_cryptocurrencies)!

But of course it won't give us the data exactly the way we want it, so we'll need to do a bit of information extraction and data wrangling. We will also try to get current price levels from [Yahoo](https://finance.yahoo.com/cryptocurrencies).

### Task 1.1: Fetch the list of pages from Wikipedia and put it into a dataframe

First we'll get the master table of "known" cryptocurrencies. Use the `read_html()` function from `pandas`. 

In [5]:
# TODO:
# (1) Fetch files from Wikipedia:  https://en.wikipedia.org/wiki/List_of_cryptocurrencies
# (2) Parse into a dataframe called cryptocurrency_df
# YOUR CODE HERE
cryptocurrency_df= pd.read_html('https://en.wikipedia.org/wiki/List_of_cryptocurrencies')[0]
display(cryptocurrency_df)
#raise NotImplementedError()

Unnamed: 0,Release,Currency,Symbol,Founder(s),Hash algorithm,Programming language of implementation,"Cryptocurrency blockchain (PoS, PoW, or other)",Notes
0,2009,Bitcoin,"BTC,[4] XBT, ₿",Satoshi Nakamoto[nt 1],SHA-256d[5][6],C++[7],PoW[6][8],The first and most widely used decentralized l...
1,2011,Litecoin,"LTC, Ł",Charlie Lee,Scrypt,C++[11],PoW,One of the first cryptocurrencies to use Scryp...
2,2011,Namecoin,NMC,Vincent Durham[12][13],SHA-256d,C++[14],PoW,"Also acts as an alternative, decentralized DNS."
3,2012,Peercoin,PPC,Sunny King(pseudonym)[citation needed],SHA-256d[citation needed],C++[15],PoW & PoS,The first cryptocurrency to use POW and POS fu...
4,2013,Dogecoin,"DOGE, XDG, Ð",Jackson Palmer& Billy Markus[16],Scrypt[17],C++[18],PoW,Based on the Doge internet meme.
5,2013[citation needed],Gridcoin,GRC,Rob Hälford[citation needed],Scrypt,C++[19],Decentralized PoS,Linked to citizen science through the Berkeley...
6,2013,Primecoin,XPM,Sunny King(pseudonym)[citation needed],1CC/2CC/TWN[21],"TypeScript, C++[22]",PoW[21],Uses the finding of prime chains composed of C...
7,2013,Ripple[23][24],XRP,Chris Larsen &Jed McCaleb[25],ECDSA[26],C++[27],"""Consensus""",Designed for peer to peer debt transfer. Not b...
8,2013,Nxt,NXT,BCNext(pseudonym),SHA-256d[28],Java[29],PoS,Specifically designed as a flexible platform t...
9,2014,Auroracoin,AUR,Baldur Odinsson(pseudonym)[30],Scrypt,C++[31],PoW,Created as an alternative currency for Iceland...


Next, do the same for the following two sites. Yahoo gives a maximum of 100 prices at a time, so this is why we have to have two queries.

In [6]:
# TODO: Make two price dataframes from
# price_1_df: https://finance.yahoo.com/cryptocurrencies/?count=100&offset=0
# price_2_df: https://finance.yahoo.com/cryptocurrencies/?count=100&offset=100

# YOUR CODE HERE
#raise NotImplementedError()
price_1_df= pd.read_html('https://finance.yahoo.com/cryptocurrencies/?count=100&offset=0')[0]
price_2_df= pd.read_html('https://finance.yahoo.com/cryptocurrencies/?count=100&offset=100')[0]
price_df = price_1_df.append(price_2_df)

display(price_df)

Unnamed: 0,Symbol,Name,Price (Intraday),Change,% Change,Market Cap,Volume in Currency (Since 0:00 UTC),Volume in Currency (24Hr),Total Volume All Currencies (24Hr),Circulating Supply,52 Week Range,1 Day Chart
0,BTC-USD,Bitcoin USD,9732.130000,37.53000,+0.39%,174.729B,56.445M,342.15M,2.907B,17.954M,,
1,XRP-USD,Ripple USD,0.270400,0.00290,+1.08%,27.038B,7.729M,40.357M,526.05M,99.992B,,
2,ETH-USD,Ethereum USD,201.060000,0.09000,+0.04%,21.688B,9.03M,65.452M,1.284B,107.869M,,
3,BCH-USD,Bitcoin Cash / BCC USD,291.850000,-0.30000,-0.10%,5.259B,7.545M,33.273M,281.568M,18.02M,,
4,LTC-USD,Litecoin USD,66.820000,0.35000,+0.53%,4.212B,5.367M,37.968M,730.148M,63.039M,,
5,USDT-USD,Tether USD,1.001000,0.00000,0.00%,4.112B,650985,5.215M,49.139M,4.108B,,
6,BNB-USD,Binance Coin USD,19.340000,-0.04000,-0.20%,3.008B,5.258M,17.742M,46.504M,155.537M,,
7,EOS-USD,EOS USD,3.687000,-0.00400,-0.11%,2.408B,1.067M,5.221M,532.305M,653.096M,,
8,LINK-USD,ChainLink USD,1.828000,0.02600,+1.44%,1.828B,655281,3.964M,85.202M,1B,,
9,XLM-USD,Stellar USD,0.064000,0.00050,+0.82%,1.286B,429567,2.439M,37.359M,20.085B,,


In [7]:
# Quick sanity check 1.1 for cryptocurrency_df: does it have the columns from the Wikipedia table?

if not 'Currency' in cryptocurrency_df:
    raise AssertionError('Expected column called "Currency"')
    
if not 'Founder(s)' in cryptocurrency_df:
    raise AssertionError('Expected column called "Founder(s)"')

display(cryptocurrency_df)

Unnamed: 0,Release,Currency,Symbol,Founder(s),Hash algorithm,Programming language of implementation,"Cryptocurrency blockchain (PoS, PoW, or other)",Notes
0,2009,Bitcoin,"BTC,[4] XBT, ₿",Satoshi Nakamoto[nt 1],SHA-256d[5][6],C++[7],PoW[6][8],The first and most widely used decentralized l...
1,2011,Litecoin,"LTC, Ł",Charlie Lee,Scrypt,C++[11],PoW,One of the first cryptocurrencies to use Scryp...
2,2011,Namecoin,NMC,Vincent Durham[12][13],SHA-256d,C++[14],PoW,"Also acts as an alternative, decentralized DNS."
3,2012,Peercoin,PPC,Sunny King(pseudonym)[citation needed],SHA-256d[citation needed],C++[15],PoW & PoS,The first cryptocurrency to use POW and POS fu...
4,2013,Dogecoin,"DOGE, XDG, Ð",Jackson Palmer& Billy Markus[16],Scrypt[17],C++[18],PoW,Based on the Doge internet meme.
5,2013[citation needed],Gridcoin,GRC,Rob Hälford[citation needed],Scrypt,C++[19],Decentralized PoS,Linked to citizen science through the Berkeley...
6,2013,Primecoin,XPM,Sunny King(pseudonym)[citation needed],1CC/2CC/TWN[21],"TypeScript, C++[22]",PoW[21],Uses the finding of prime chains composed of C...
7,2013,Ripple[23][24],XRP,Chris Larsen &Jed McCaleb[25],ECDSA[26],C++[27],"""Consensus""",Designed for peer to peer debt transfer. Not b...
8,2013,Nxt,NXT,BCNext(pseudonym),SHA-256d[28],Java[29],PoS,Specifically designed as a flexible platform t...
9,2014,Auroracoin,AUR,Baldur Odinsson(pseudonym)[30],Scrypt,C++[31],PoW,Created as an alternative currency for Iceland...


In [0]:
# Hidden tests 1.1 for autograding cryptocurrency_df - don't delete this cell please!


### Task 1.2 First bit of data Cleaning:  Clean up the schema names.

It turns out that SQL databases often don't like parentheses and spaces in the column names.  Change the column names for the appropriate columns, by 

1. removing the parts in parentheses
2. trimming any blank spaces before or after the names
3. inserting underscores for spaces.  

Hint: there are functions called `trim`, `strip`, `find`, `replace`.

In [8]:
# TODO:
# For all column names in cryptocurrency_df, 
# (1) remove anything in parentheses, 
# (2) remove leading and trailing spaces, 
# (3) replace remaining spaces with underscores

# YOUR CODE HERE

cryptocurrency_df.columns=cryptocurrency_df.columns.str.replace(r"\(.*\)","").str.strip().str.replace(' ','_')
'''
cols=list(cryptocurrency_df.columns) 
for i in range(len(cols)):
  cols[i]=re.sub(r'\([^)]*\)', '', cols[i])
  cols[i]=cols[i].strip()
  cols[i]=cols[i].replace(' ', '_')
cryptocurrency_df.columns=cols 
'''
display(cryptocurrency_df)
    
  
  

Unnamed: 0,Release,Currency,Symbol,Founder,Hash_algorithm,Programming_language_of_implementation,Cryptocurrency_blockchain,Notes
0,2009,Bitcoin,"BTC,[4] XBT, ₿",Satoshi Nakamoto[nt 1],SHA-256d[5][6],C++[7],PoW[6][8],The first and most widely used decentralized l...
1,2011,Litecoin,"LTC, Ł",Charlie Lee,Scrypt,C++[11],PoW,One of the first cryptocurrencies to use Scryp...
2,2011,Namecoin,NMC,Vincent Durham[12][13],SHA-256d,C++[14],PoW,"Also acts as an alternative, decentralized DNS."
3,2012,Peercoin,PPC,Sunny King(pseudonym)[citation needed],SHA-256d[citation needed],C++[15],PoW & PoS,The first cryptocurrency to use POW and POS fu...
4,2013,Dogecoin,"DOGE, XDG, Ð",Jackson Palmer& Billy Markus[16],Scrypt[17],C++[18],PoW,Based on the Doge internet meme.
5,2013[citation needed],Gridcoin,GRC,Rob Hälford[citation needed],Scrypt,C++[19],Decentralized PoS,Linked to citizen science through the Berkeley...
6,2013,Primecoin,XPM,Sunny King(pseudonym)[citation needed],1CC/2CC/TWN[21],"TypeScript, C++[22]",PoW[21],Uses the finding of prime chains composed of C...
7,2013,Ripple[23][24],XRP,Chris Larsen &Jed McCaleb[25],ECDSA[26],C++[27],"""Consensus""",Designed for peer to peer debt transfer. Not b...
8,2013,Nxt,NXT,BCNext(pseudonym),SHA-256d[28],Java[29],PoS,Specifically designed as a flexible platform t...
9,2014,Auroracoin,AUR,Baldur Odinsson(pseudonym)[30],Scrypt,C++[31],PoW,Created as an alternative currency for Iceland...


In [9]:
# Sanity check 1.2 for cryptocurrency_df

for column in cryptocurrency_df.keys():
    if column.find(' ') >= 0:
        raise AssertionError('Forgot to remove a space in "%s"'%column)
    elif column.find('(') >= 0 or column.find(')') >= 0:
        raise AssertionError('Forgot to remove a paren in %s'%column)
        
display(cryptocurrency_df)

Unnamed: 0,Release,Currency,Symbol,Founder,Hash_algorithm,Programming_language_of_implementation,Cryptocurrency_blockchain,Notes
0,2009,Bitcoin,"BTC,[4] XBT, ₿",Satoshi Nakamoto[nt 1],SHA-256d[5][6],C++[7],PoW[6][8],The first and most widely used decentralized l...
1,2011,Litecoin,"LTC, Ł",Charlie Lee,Scrypt,C++[11],PoW,One of the first cryptocurrencies to use Scryp...
2,2011,Namecoin,NMC,Vincent Durham[12][13],SHA-256d,C++[14],PoW,"Also acts as an alternative, decentralized DNS."
3,2012,Peercoin,PPC,Sunny King(pseudonym)[citation needed],SHA-256d[citation needed],C++[15],PoW & PoS,The first cryptocurrency to use POW and POS fu...
4,2013,Dogecoin,"DOGE, XDG, Ð",Jackson Palmer& Billy Markus[16],Scrypt[17],C++[18],PoW,Based on the Doge internet meme.
5,2013[citation needed],Gridcoin,GRC,Rob Hälford[citation needed],Scrypt,C++[19],Decentralized PoS,Linked to citizen science through the Berkeley...
6,2013,Primecoin,XPM,Sunny King(pseudonym)[citation needed],1CC/2CC/TWN[21],"TypeScript, C++[22]",PoW[21],Uses the finding of prime chains composed of C...
7,2013,Ripple[23][24],XRP,Chris Larsen &Jed McCaleb[25],ECDSA[26],C++[27],"""Consensus""",Designed for peer to peer debt transfer. Not b...
8,2013,Nxt,NXT,BCNext(pseudonym),SHA-256d[28],Java[29],PoS,Specifically designed as a flexible platform t...
9,2014,Auroracoin,AUR,Baldur Odinsson(pseudonym)[30],Scrypt,C++[31],PoW,Created as an alternative currency for Iceland...


In [0]:
# Hidden tests 1.2 for autograding cryptocurrency_df - please don't delete


### Task 1.3: Joining the tables

We are now going to try to put these two sources of information into one table. The requirement is that we want to make sure that we have an entry for every currency in the Wikipedia list, but not necessarily for every currency in the Yahoo price list. Of the four types of join, two can achieve this requirement. For extra practice, see if you can figure out both correct answers.

#### Task 1.3.1 Attempt #1

In the cell below, join `cryptocurrency_df` and `price_df` using "Name" as the join index of `price_df` and "Currency" as the join index of `cryptocurrency_df`. The result should be named `joined_on_name_df`. Do not make any changes to the data frames yet, even though you may see a problem with joining them now.

In [10]:
# TODO: Join cryptocurrency_df and price_df

# YOUR CODE HERE
joined_on_name_df=cryptocurrency_df.merge(price_df, left_on=['Currency'], right_on=['Name'],how='left')
display(joined_on_name_df)

Unnamed: 0,Release,Currency,Symbol_x,Founder,Hash_algorithm,Programming_language_of_implementation,Cryptocurrency_blockchain,Notes,Symbol_y,Name,Price (Intraday),Change,% Change,Market Cap,Volume in Currency (Since 0:00 UTC),Volume in Currency (24Hr),Total Volume All Currencies (24Hr),Circulating Supply,52 Week Range,1 Day Chart
0,2009,Bitcoin,"BTC,[4] XBT, ₿",Satoshi Nakamoto[nt 1],SHA-256d[5][6],C++[7],PoW[6][8],The first and most widely used decentralized l...,,,,,,,,,,,,
1,2011,Litecoin,"LTC, Ł",Charlie Lee,Scrypt,C++[11],PoW,One of the first cryptocurrencies to use Scryp...,,,,,,,,,,,,
2,2011,Namecoin,NMC,Vincent Durham[12][13],SHA-256d,C++[14],PoW,"Also acts as an alternative, decentralized DNS.",,,,,,,,,,,,
3,2012,Peercoin,PPC,Sunny King(pseudonym)[citation needed],SHA-256d[citation needed],C++[15],PoW & PoS,The first cryptocurrency to use POW and POS fu...,,,,,,,,,,,,
4,2013,Dogecoin,"DOGE, XDG, Ð",Jackson Palmer& Billy Markus[16],Scrypt[17],C++[18],PoW,Based on the Doge internet meme.,,,,,,,,,,,,
5,2013[citation needed],Gridcoin,GRC,Rob Hälford[citation needed],Scrypt,C++[19],Decentralized PoS,Linked to citizen science through the Berkeley...,,,,,,,,,,,,
6,2013,Primecoin,XPM,Sunny King(pseudonym)[citation needed],1CC/2CC/TWN[21],"TypeScript, C++[22]",PoW[21],Uses the finding of prime chains composed of C...,,,,,,,,,,,,
7,2013,Ripple[23][24],XRP,Chris Larsen &Jed McCaleb[25],ECDSA[26],C++[27],"""Consensus""",Designed for peer to peer debt transfer. Not b...,,,,,,,,,,,,
8,2013,Nxt,NXT,BCNext(pseudonym),SHA-256d[28],Java[29],PoS,Specifically designed as a flexible platform t...,,,,,,,,,,,,
9,2014,Auroracoin,AUR,Baldur Odinsson(pseudonym)[30],Scrypt,C++[31],PoW,Created as an alternative currency for Iceland...,,,,,,,,,,,,


In [0]:
# Sanity check 1.3.1 for joined_on_name_df

if len(joined_on_name_df.columns) != 20:
    raise AssertionError('Your joined table has %d columns, an unexpected number.'%len(joined_on_name_df.columns))

In [0]:
# Hidden tests 1.3.1 for autograding joined_on_name_df - please don't delete


#### Task 1.3.2 Cleaning up the names

You may have noticed a mismatch for how the currencies are named between the two data frames. Use the `apply` function to replace the values in the `price_df["Name"]` column so they better match the values in `cryptocurrency_df["Currency"]`.

Then rerun your join from 1.3.1 and name it the same way.

In [12]:
# TODO: Remove Fix Name column in price_df and redo the join

# YOUR CODE HERE
price_df['Name'] = price_df['Name'].apply(lambda x: x.replace('USD', ''))
price_df['Name'] = price_df['Name'].apply(lambda x: x.strip())
joined_on_name_df=cryptocurrency_df.merge(price_df, left_on=['Currency'], right_on=['Name'],how='left')
display(joined_on_name_df)

Unnamed: 0,Release,Currency,Symbol_x,Founder,Hash_algorithm,Programming_language_of_implementation,Cryptocurrency_blockchain,Notes,Symbol_y,Name,Price (Intraday),Change,% Change,Market Cap,Volume in Currency (Since 0:00 UTC),Volume in Currency (24Hr),Total Volume All Currencies (24Hr),Circulating Supply,52 Week Range,1 Day Chart
0,2009,Bitcoin,"BTC,[4] XBT, ₿",Satoshi Nakamoto[nt 1],SHA-256d[5][6],C++[7],PoW[6][8],The first and most widely used decentralized l...,BTC-USD,Bitcoin,9732.13,37.53,+0.39%,174.729B,56.445M,342.15M,2.907B,17.954M,,
1,2011,Litecoin,"LTC, Ł",Charlie Lee,Scrypt,C++[11],PoW,One of the first cryptocurrencies to use Scryp...,LTC-USD,Litecoin,66.82,0.35,+0.53%,4.212B,5.367M,37.968M,730.148M,63.039M,,
2,2011,Namecoin,NMC,Vincent Durham[12][13],SHA-256d,C++[14],PoW,"Also acts as an alternative, decentralized DNS.",,,,,,,,,,,,
3,2012,Peercoin,PPC,Sunny King(pseudonym)[citation needed],SHA-256d[citation needed],C++[15],PoW & PoS,The first cryptocurrency to use POW and POS fu...,,,,,,,,,,,,
4,2013,Dogecoin,"DOGE, XDG, Ð",Jackson Palmer& Billy Markus[16],Scrypt[17],C++[18],PoW,Based on the Doge internet meme.,DOGE-USD,Dogecoin,0.0024,0.0,0.00%,295.196M,445887,2.088M,8.234M,121.329B,,
5,2013[citation needed],Gridcoin,GRC,Rob Hälford[citation needed],Scrypt,C++[19],Decentralized PoS,Linked to citizen science through the Berkeley...,,,,,,,,,,,,
6,2013,Primecoin,XPM,Sunny King(pseudonym)[citation needed],1CC/2CC/TWN[21],"TypeScript, C++[22]",PoW[21],Uses the finding of prime chains composed of C...,,,,,,,,,,,,
7,2013,Ripple[23][24],XRP,Chris Larsen &Jed McCaleb[25],ECDSA[26],C++[27],"""Consensus""",Designed for peer to peer debt transfer. Not b...,,,,,,,,,,,,
8,2013,Nxt,NXT,BCNext(pseudonym),SHA-256d[28],Java[29],PoS,Specifically designed as a flexible platform t...,NXT-USD,Nxt,0.0156,-0.0001,-0.62%,15.571M,79677,194528,294371,1B,,
9,2014,Auroracoin,AUR,Baldur Odinsson(pseudonym)[30],Scrypt,C++[31],PoW,Created as an alternative currency for Iceland...,,,,,,,,,,,,


In [0]:
# Sanity check 1.3.2 for joined_on_name_df

if len(joined_on_name_df[joined_on_name_df["Name"].notna()]) == 0:
    raise AssertionError('Your join did not find any matches. Maybe you did something wrong?')

In [0]:
# Hidden tests 1.3.2 for autograding joined_on_name_df cleaned - please don't delete


#### Task 1.3.3: Clean the citations out of the content.

As we saw in lecture, the html processing function converts Wikipedia citations to normal text. You may have noticed that this is keeping at least one of the cryptocurrencies from matching during the join. In the cell below, use `applymap` to remove these citations from the entire `cryptocurrency_df` table. Assume that every instance of "`[`" begins a citation. In this case only, it is okay if you delete everything after the "`[`", including the stuff after "`]`".

Then rerun your join from 1.3.2 and name it the same way. Did you get more matches?

In [14]:
# TODO: Remove citations

# YOUR CODE HERE
#res=re.sub(r'\[(.*?)\]', '', 'SHA-256d[5][6]')
#print(res)
#cryptocurrency_df=cryptocurrency_df.applymap(lambda x:re.sub(r'\[(.*?)\]', '', str(x)))
cryptocurrency_df=cryptocurrency_df.applymap(lambda x:str(x).split('[')[0].strip())
joined_on_name_df=cryptocurrency_df.merge(price_df, left_on=['Currency'], right_on=['Name'],how='left')
display(joined_on_name_df)


Unnamed: 0,Release,Currency,Symbol_x,Founder,Hash_algorithm,Programming_language_of_implementation,Cryptocurrency_blockchain,Notes,Symbol_y,Name,Price (Intraday),Change,% Change,Market Cap,Volume in Currency (Since 0:00 UTC),Volume in Currency (24Hr),Total Volume All Currencies (24Hr),Circulating Supply,52 Week Range,1 Day Chart
0,2009,Bitcoin,"BTC,",Satoshi Nakamoto,SHA-256d,C++,PoW,The first and most widely used decentralized l...,BTC-USD,Bitcoin,9732.13,37.53,+0.39%,174.729B,56.445M,342.15M,2.907B,17.954M,,
1,2011,Litecoin,"LTC, Ł",Charlie Lee,Scrypt,C++,PoW,One of the first cryptocurrencies to use Scryp...,LTC-USD,Litecoin,66.82,0.35,+0.53%,4.212B,5.367M,37.968M,730.148M,63.039M,,
2,2011,Namecoin,NMC,Vincent Durham,SHA-256d,C++,PoW,"Also acts as an alternative, decentralized DNS.",,,,,,,,,,,,
3,2012,Peercoin,PPC,Sunny King(pseudonym),SHA-256d,C++,PoW & PoS,The first cryptocurrency to use POW and POS fu...,,,,,,,,,,,,
4,2013,Dogecoin,"DOGE, XDG, Ð",Jackson Palmer& Billy Markus,Scrypt,C++,PoW,Based on the Doge internet meme.,DOGE-USD,Dogecoin,0.0024,0.0,0.00%,295.196M,445887,2.088M,8.234M,121.329B,,
5,2013,Gridcoin,GRC,Rob Hälford,Scrypt,C++,Decentralized PoS,Linked to citizen science through the Berkeley...,,,,,,,,,,,,
6,2013,Primecoin,XPM,Sunny King(pseudonym),1CC/2CC/TWN,"TypeScript, C++",PoW,Uses the finding of prime chains composed of C...,,,,,,,,,,,,
7,2013,Ripple,XRP,Chris Larsen &Jed McCaleb,ECDSA,C++,"""Consensus""",Designed for peer to peer debt transfer. Not b...,XRP-USD,Ripple,0.2704,0.0029,+1.08%,27.038B,7.729M,40.357M,526.05M,99.992B,,
8,2013,Nxt,NXT,BCNext(pseudonym),SHA-256d,Java,PoS,Specifically designed as a flexible platform t...,NXT-USD,Nxt,0.0156,-0.0001,-0.62%,15.571M,79677,194528,294371,1B,,
9,2014,Auroracoin,AUR,Baldur Odinsson(pseudonym),Scrypt,C++,PoW,Created as an alternative currency for Iceland...,,,,,,,,,,,,


In [15]:
# Sanity check 1.3.3 for joined_on_name_df

print("%d matches found"%len(joined_on_name_df[joined_on_name_df["Name"].notna()]))
if len(joined_on_name_df[joined_on_name_df["Name"].notna()]) == 0:
    raise AssertionError('Your join did not find any matches. Maybe you did something wrong?')

12 matches found


In [0]:
# Hidden tests 1.3.3 for autograding citation deletion - please don't delete


#### Task 1.3.4 A Better Column

Look again at `cryptocurrency_df` and `price_df` and select better columns for indexing the join. Consider an `apply` function for the relevant column in `cryptocurrency_df` and for the relevant column in price_df` that you select. 

Name this table `joined_df`. To get the points for this section, you need to match at least as many currencies as our solution.

In [16]:
# TODO: Improve the join by switching to different columns

# YOUR CODE HERE
price_df['Symbol'] = price_df['Symbol'].apply(lambda x:x.replace('-USD', ''))
#price_df['Symbol']=price_df['Symbol'].apply(lambda x:str(x).strip())
cryptocurrency_df['Symbol'] = cryptocurrency_df['Symbol'].apply(lambda x:re.sub(r'\[(.*?)\]', '', str(x)))
cryptocurrency_df['Symbol'] = cryptocurrency_df['Symbol'].apply(lambda x:x.split(",")[0])
joined_df=cryptocurrency_df.merge(price_df, left_on=['Symbol'], right_on=['Symbol'],how='left')
#display(joined_on_name_df)
display(joined_df)

Unnamed: 0,Release,Currency,Symbol,Founder,Hash_algorithm,Programming_language_of_implementation,Cryptocurrency_blockchain,Notes,Name,Price (Intraday),Change,% Change,Market Cap,Volume in Currency (Since 0:00 UTC),Volume in Currency (24Hr),Total Volume All Currencies (24Hr),Circulating Supply,52 Week Range,1 Day Chart
0,2009,Bitcoin,BTC,Satoshi Nakamoto,SHA-256d,C++,PoW,The first and most widely used decentralized l...,Bitcoin,9732.13,37.53,+0.39%,174.729B,56.445M,342.15M,2.907B,17.954M,,
1,2011,Litecoin,LTC,Charlie Lee,Scrypt,C++,PoW,One of the first cryptocurrencies to use Scryp...,Litecoin,66.82,0.35,+0.53%,4.212B,5.367M,37.968M,730.148M,63.039M,,
2,2011,Namecoin,NMC,Vincent Durham,SHA-256d,C++,PoW,"Also acts as an alternative, decentralized DNS.",,,,,,,,,,,
3,2012,Peercoin,PPC,Sunny King(pseudonym),SHA-256d,C++,PoW & PoS,The first cryptocurrency to use POW and POS fu...,,,,,,,,,,,
4,2013,Dogecoin,DOGE,Jackson Palmer& Billy Markus,Scrypt,C++,PoW,Based on the Doge internet meme.,Dogecoin,0.0024,0.0,0.00%,295.196M,445887,2.088M,8.234M,121.329B,,
5,2013,Gridcoin,GRC,Rob Hälford,Scrypt,C++,Decentralized PoS,Linked to citizen science through the Berkeley...,,,,,,,,,,,
6,2013,Primecoin,XPM,Sunny King(pseudonym),1CC/2CC/TWN,"TypeScript, C++",PoW,Uses the finding of prime chains composed of C...,,,,,,,,,,,
7,2013,Ripple,XRP,Chris Larsen &Jed McCaleb,ECDSA,C++,"""Consensus""",Designed for peer to peer debt transfer. Not b...,Ripple,0.2704,0.0029,+1.08%,27.038B,7.729M,40.357M,526.05M,99.992B,,
8,2013,Nxt,NXT,BCNext(pseudonym),SHA-256d,Java,PoS,Specifically designed as a flexible platform t...,Nxt,0.0156,-0.0001,-0.62%,15.571M,79677,194528,294371,1B,,
9,2014,Auroracoin,AUR,Baldur Odinsson(pseudonym),Scrypt,C++,PoW,Created as an alternative currency for Iceland...,,,,,,,,,,,


In [17]:
# Sanity check 1.3.4 for joined_df

print("%d matches found"%len(joined_df[joined_df["Name"].notna()]))
if len(joined_df[joined_df["Name"].notna()]) <= len(joined_on_name_df[joined_on_name_df["Name"].notna()]):
    raise AssertionError('Your new join is not better than the old one. Maybe you did something wrong?')

18 matches found


In [0]:
# Hidden tests 1.3.4 for autograding joined_df  - please don't delete


### Task 1.4: Save the cryptocurrency list in a database table

We don't want to continue to hit Wikipedia.org every time we want to consult the list of cryptocurrencies.  Save your `cryptocurrency_df` to sqlite, in a table called `cryptocurrency`.  

**The Dataframe `index` has no particular meaning, so don't save it!**

In [0]:
# TODO: convert cryptocurrency_df to sqlite

conn = sqlite3.connect('local.db')
cryptocurrency_df.to_sql('cryptocurrency', conn, if_exists='replace', index=False)


# YOUR CODE HERE


In [19]:
# Sanity check 1.4 for sqlite databases

crypto2 = pd.read_sql_query('select * from cryptocurrency', conn)
crypto2
if 'index' in crypto2:
    raise AssertionError('Please disable the index, since it isn\'t important information')
    
display(crypto2)

Unnamed: 0,Release,Currency,Symbol,Founder,Hash_algorithm,Programming_language_of_implementation,Cryptocurrency_blockchain,Notes
0,2009,Bitcoin,BTC,Satoshi Nakamoto,SHA-256d,C++,PoW,The first and most widely used decentralized l...
1,2011,Litecoin,LTC,Charlie Lee,Scrypt,C++,PoW,One of the first cryptocurrencies to use Scryp...
2,2011,Namecoin,NMC,Vincent Durham,SHA-256d,C++,PoW,"Also acts as an alternative, decentralized DNS."
3,2012,Peercoin,PPC,Sunny King(pseudonym),SHA-256d,C++,PoW & PoS,The first cryptocurrency to use POW and POS fu...
4,2013,Dogecoin,DOGE,Jackson Palmer& Billy Markus,Scrypt,C++,PoW,Based on the Doge internet meme.
5,2013,Gridcoin,GRC,Rob Hälford,Scrypt,C++,Decentralized PoS,Linked to citizen science through the Berkeley...
6,2013,Primecoin,XPM,Sunny King(pseudonym),1CC/2CC/TWN,"TypeScript, C++",PoW,Uses the finding of prime chains composed of C...
7,2013,Ripple,XRP,Chris Larsen &Jed McCaleb,ECDSA,C++,"""Consensus""",Designed for peer to peer debt transfer. Not b...
8,2013,Nxt,NXT,BCNext(pseudonym),SHA-256d,Java,PoS,Specifically designed as a flexible platform t...
9,2014,Auroracoin,AUR,Baldur Odinsson(pseudonym),Scrypt,C++,PoW,Created as an alternative currency for Iceland...


### Task 1.5: Read the cryptocurrency pages

Now let's take each of the cryptocurrency names and find the associated URL. The names of the currencies were originally clickable links on the [webpage](https://en.wikipedia.org/wiki/List_of_cryptocurrencies) that we made the table from, but unfortunately, `pandas` automatically deleted the URLs. So we have to regenerate them. Feel free to look at that page to see what the correct URL is for each currency.

In the cell below, complete the function `crawl`. The function name, inputs, first line, and last line are provided for you. 

`list_of_urls` should contain the URLs of interest as a list, column of a pandas DataFrame, or some other iterable over strings. 

`prefix` contains a common string that should be added to the beginning every URL in `list_of_urls` before each URL is queried. 

The line `pages = {}` creates an empty dictionary. After running your part of the function `crawl`, `pages` should have currency names as its keys and the corresponding Wikipedia page contents as its values. This is what the function returns.

You have two options for completing this cell:

1. If you want to use `urllib.request.urlopen`, you should then use `read()` and `decode('utf-8')`.

2. If you want to use `scrapy`, follow the process in [this notebook from class](https://www.google.com/url?q=https://drive.google.com/file/d/1VfnlGr_VofdcEqACM2jRu2BwYm0QyTSh/view?usp%3Dsharing&sa=D&ust=1567968915286000&usg=AFQjCNG5iEWgUoA3DrRLhV1TKiT2OXHD1A).

For now, use a `try` statement to catch the errors and print a message that the URL could not be crawled. That is, in this cell we will have a **single rule** and not do any manual cleaning.  If you were doing this at web scale, you would be reluctant to invest a lot of manual effort...

In [20]:
# TODO: Crawl the pages.  
# Trap the errors and figure out what you need to fix (in the cleaning step below)

def crawl(list_of_urls, prefix=""):
  pages = {}
  for ele in list_of_urls:
    try:
      response=urllib.request.urlopen(prefix+ele)
      page = response.read()
      page = page.decode('utf-8')
      response.close
      pages[ele]=page
    except:
      print("HTTP Error 404: Not Found for"+' '+ele)
  return pages
''' 
import scrapy
from scrapy.crawler import CrawlerProcess
pages = {}
def crawl(list_of_urls, prefix=""):
  crawl_list=[]
  for ele in list_of_urls:
      crawl_list.append(prefix+ele)
  
  class WebCrawler(scrapy.Spider):
      name = "cis545"

      def start_requests(self):
          for url in crawl_list:
              yield scrapy.Request(url=url, callback=self.parse)

      def parse(self, response):
          global pages
          page = response.url.split("/")[-1]
          pages[page]=response
          self.log('Looking at file %s' % page)

  process = CrawlerProcess({
      'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
  })

  process.crawl(WebCrawler)
  process.start() # the script will block here until the crawling is finished
  return pages
'''
  

' \nimport scrapy\nfrom scrapy.crawler import CrawlerProcess\npages = {}\ndef crawl(list_of_urls, prefix=""):\n  crawl_list=[]\n  for ele in list_of_urls:\n      crawl_list.append(prefix+ele)\n  \n  class WebCrawler(scrapy.Spider):\n      name = "cis545"\n\n      def start_requests(self):\n          for url in crawl_list:\n              yield scrapy.Request(url=url, callback=self.parse)\n\n      def parse(self, response):\n          global pages\n          page = response.url.split("/")[-1]\n          pages[page]=response\n          self.log(\'Looking at file %s\' % page)\n\n  process = CrawlerProcess({\n      \'USER_AGENT\': \'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)\'\n  })\n\n  process.crawl(WebCrawler)\n  process.start() # the script will block here until the crawling is finished\n  return pages\n'

The following cell passes the currencies in our table to the `crawl` function. You do not need to modify the cell.

In [21]:
# Sanity check 1.5.1 for initial crawl

pages = crawl(cryptocurrency_df['Currency'], 'https://en.wikipedia.org/wiki/')
for page in pages:
    print (page)
    
print ('Total crawl: %d cryptocurrencies'%len(pages))
type(pages['Bitcoin'])



HTTP Error 404: Not Found for Ether or "Ethereum"
Bitcoin
Litecoin
Namecoin
Peercoin
Dogecoin
Gridcoin
Primecoin
Ripple
Nxt
Auroracoin
Dash
NEO
MazaCoin
Monero
NEM
PotCoin
Titcoin
Verge
Stellar
Vertcoin
Ethereum Classic
Tether
Zcash
Bitcoin Cash
EOS.IO
Total crawl: 25 cryptocurrencies


str

In [0]:
# Hidden tests 1.5.1 for autograding pages  - please don't delete


Did you get any errors? Did you ever get the wrong URL (and therefore the content from the wrong page)? Fix those two problems in the function `crawl_better` below. This function has the same inputs and outputs as `crawl`, but this time, it is okay if your fixes are specific to these sites. For example, you can try attaching `_(disambiguation)`, pull up that page's `etree.HTML(content)` and look for a link that has the name of the currency plus `' (cryptocurrency)'`.

In [0]:
# TODO: Re-run the crawl, fixing the issues

# Crawl the pages.  You may use urllib.request.urlopen or scrapy
# Assemble the list of results in the list pages.
# Trap the errors and figure out what you need to fix (in the cleaning step below)
'''
import scrapy
from scrapy.crawler import CrawlerProcess
pages={}  
def crawl_better(list_of_urls, prefix=""):
  disambiguation = {'Ripple':'Ripple_(payment_protocol)','Dash':'Dash_(cryptocurrency)','Monero':'Monero_(cryptocurrency)','NEM':'NEM_(cryptocurrency)','Verge':'Verge_(cryptocurrency)','Stellar':'Stellar_(payment_network)','Tether':'Tether_(cryptocurrency)'}
  crawl_list=[]
  for ele in list_of_urls:
    if ele in disambiguation:
      crawl_list.append(prefix+disambiguation[ele])
    else:
      crawl_list.append(prefix+ele)

  class WebCrawler(scrapy.Spider):
      name = "cis545"

      def start_requests(self):
          for url in crawl_list:
              yield scrapy.Request(url=url, callback=self.parse)

      def parse(self, response):
          global pages
          page = response.url.split("/")[-1]
          pages[page]=response
          self.log('Looking at file %s' % page)

  process = CrawlerProcess({
      'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
  })

  process.crawl(WebCrawler)
  process.start() # the script will block here until the crawling is finished
  return pages
  '''    
def crawl_better(list_of_urls, prefix=""):
  pages = {}
  for ele in list_of_urls:
    if ele =='Ether or "Ethereum"':continue
    response=urllib.request.urlopen(prefix+ele)
    page = response.read()
    page = page.decode('utf-8')
    response.close
    pages[ele]=page
    
  response=urllib.request.urlopen('https://en.wikipedia.org/wiki/Ethereum')
  page = response.read()
  page = page.decode('utf-8')
  response.close
  pages['Ethereum']=page
  
  response=urllib.request.urlopen('https://en.wikipedia.org/wiki/Ripple_(payment_protocol)')
  page = response.read()
  page = page.decode('utf-8')
  response.close
  pages['Ripple']=page

  response=urllib.request.urlopen('https://en.wikipedia.org/wiki/Dash_(cryptocurrency)')
  page = response.read()
  page = page.decode('utf-8')
  response.close
  pages['Dash']=page

  response=urllib.request.urlopen('https://en.wikipedia.org/wiki/Monero_(cryptocurrency)')
  page = response.read()
  page = page.decode('utf-8')
  response.close
  pages['Monero']=page

  response=urllib.request.urlopen('https://en.wikipedia.org/wiki/NEM_(cryptocurrency)')
  page = response.read()
  page = page.decode('utf-8')
  response.close
  pages['NEM']=page

  response=urllib.request.urlopen('https://en.wikipedia.org/wiki/Verge_(cryptocurrency)')
  page = response.read()
  page = page.decode('utf-8')
  response.close
  pages['Verge']=page

  response=urllib.request.urlopen('https://en.wikipedia.org/wiki/Stellar_(payment_network)')
  page = response.read()
  page = page.decode('utf-8')
  response.close
  pages['Stellar']=page

  response=urllib.request.urlopen('https://en.wikipedia.org/wiki/Tether_(cryptocurrency)')
  page = response.read()
  page = page.decode('utf-8')
  response.close
  pages['Tether']=page

  return pages



As before, the cell below just runs your function and does not need to be modified.

In [23]:
# Sanity check 1.5.2 for better crawl

pages = crawl_better(cryptocurrency_df['Currency'], 'https://en.wikipedia.org/wiki/')
'''
def crawl_better(list_of_urls, prefix=""):
  lst_pro = ['Ripple']
  lst_crypto = ['Dash','Monero','NEM','Verge','Tether','Petro']
  lst_net = ['Stellar']
  list_of_urls = ['Ethereum' if x=='Ether or "Ethereum"' else x for x in list_of_urls] 
  pages = {}
  for url in list_of_urls:
    if url in lst_pro:
      suffix = '_(payment_protocol)'
    elif url in lst_crypto:
      suffix = '_(cryptocurrency)'
    elif url in lst_net:
      suffix = '_(payment_network)'
    else:
      suffix = ''
    try:
      with urllib.request.urlopen(prefix + url + suffix) as res:
        f = res.read().decode('utf-8')
      pages.update({url:f})
    except:
      print(url + ' :wrong URL')
  return pages
pages = crawl_better(cryptocurrency_df['Currency'], 'https://en.wikipedia.org/wiki/')
print(pages)
'''


'\ndef crawl_better(list_of_urls, prefix=""):\n  lst_pro = [\'Ripple\']\n  lst_crypto = [\'Dash\',\'Monero\',\'NEM\',\'Verge\',\'Tether\',\'Petro\']\n  lst_net = [\'Stellar\']\n  list_of_urls = [\'Ethereum\' if x==\'Ether or "Ethereum"\' else x for x in list_of_urls] \n  pages = {}\n  for url in list_of_urls:\n    if url in lst_pro:\n      suffix = \'_(payment_protocol)\'\n    elif url in lst_crypto:\n      suffix = \'_(cryptocurrency)\'\n    elif url in lst_net:\n      suffix = \'_(payment_network)\'\n    else:\n      suffix = \'\'\n    try:\n      with urllib.request.urlopen(prefix + url + suffix) as res:\n        f = res.read().decode(\'utf-8\')\n      pages.update({url:f})\n    except:\n      print(url + \' :wrong URL\')\n  return pages\npages = crawl_better(cryptocurrency_df[\'Currency\'], \'https://en.wikipedia.org/wiki/\')\nprint(pages)\n'

In [0]:
# Hidden tests 1.5.2 for autograding pages  - please don't delete


### Task 1.6: Sanity-check and fix

Note that sometimes terms in Wikipedia are **ambiguous**, so just following the page doesn't always get what you want.  The Wikipedia page for [Tether](https://en.wikipedia.org/wiki/Tether) does not describe a cryptocurrency.

We can add a data-cleaning rule to check this: every cryptocurrency should mention the term "blockchain".  Here's a sanity check you can use.  If there are any disambiguation pages, you need to go back to Task 1.5 and update your process to crawl the right page.

You do not need to modify this cell.

In [24]:
count_wrong = 0

for page,content in pages.items():
    if isinstance(content, bytes):
        raise AssertionError('Please run decode(\'utf-8\') on the content to decode to a string')
        content = content.decode('utf-8')
        print(content)
        
    if 'blockchain' not in content:
        print(page + ': ' + ' -- did not find blockchain!')
        count_wrong = count_wrong + 1

        
print ('Total crawl: %d cryptocurrencies'%len(pages))

if count_wrong > 0:
    raise AssertionError('Need to follow Wikipedia disambiguation pages on %d items!'%count_wrong)

Total crawl: 26 cryptocurrencies


### Task 1.7: Clean the articles

So far, we have captured HTML content for each Wikipedia article, but HTML is not very easy to read and process. So the next step is to clean up the text in each article. To do that, you need to complete the function definition below. The function name, and input are provided for you. 

The first step is to get a list of paragraphs of content. See our [slides](https://www.google.com/url?q=https://drive.google.com/a/seas.upenn.edu/file/d/163sCi0h5RJAXynE1Vo37bAQtOvcwW_wv/view?usp%3Dsharing&sa=D&ust=1567968915286000&usg=AFQjCNGDBY3SNFEJIh3m5k7GyYmhK2Q52w) on xpath for hints. Then, for each word (string between whitespace characters):

1. Remove the leading and trailing whitespace using `strip()`
2. Remove the word entirely if it is only white space.
3. Remove the word entirely if it is only numerics (you may use `isnumeric()` to test for this).

Finally join the words together into one string with spaces in between using `' '.join()`. The function should return that string (output).

In [0]:
# TODO: Complete the clean_article function, as described above.

'''

def clean_article(content):
  elements = content.xpath('//table[contains(@class,"vcard")]')
  if elements:
    content = elements.xpath('tbody/tr/td//span[@class="content"]/text()').get()
    print(content)
    
    if content:
      content.str.strip().str.replace(' ','')
      if content.isnumeric():
        content=''
     
  return content
'''
def clean_article(content):
  #article=etree.HTML(content).xpath("//p//text()") 
  #article=etree.HTML(content).xpath('//*[@id ="mw-content-text"]/div/p//text()')
  article=etree.HTML(content).xpath('//p//text()')
  for i in range(len(article)):
    article[i]=article[i].strip()
    if article[i]==' ':article[i]=''
    if article[i].isnumeric():
      article[i]=''
  article=" ".join(article).strip()
  #article=article.strip()
  
  return article




# YOUR CODE HERE

The following cell assembles our cleaned articles into a DataFrame. You do not need to modify the cell.

In [31]:
pages2 = []
for currency_name, content in pages.items():
    article = clean_article(content)
    pages2.append({'currency': currency_name, 'text': article})

pages_df = pd.DataFrame(pages2)

display(pages_df)


Unnamed: 0,currency,text
0,Bitcoin,Bitcoin [a] ( ₿ ) is a cryptocurrency . It is ...
1,Litecoin,Litecoin ( LTC or Ł ) is a peer-to-peer crypt...
2,Namecoin,Namecoin ( Symbol : ℕ or NMC ) is a cryptocurr...
3,Peercoin,"Peercoin , also known as PPCoin or PPC , is a ..."
4,Dogecoin,"Dogecoin ( / ˈ d oʊ dʒ k ɔɪ n / DOHJ -koyn , ..."
5,Gridcoin,"Gridcoin implements a ""Proof-of-Research"" (POR..."
6,Primecoin,Primecoin ( sign : Ψ ; code: XPM ) is a crypto...
7,Ripple,"Ripple is a real-time gross settlement system,..."
8,Nxt,Nxt is an open source cryptocurrency and paym...
9,Auroracoin,"Auroracoin (code: AUR, symbol: ᚠ ) is a peer-t..."




In [0]:
# Hidden tests 1.7 for autograding clean_article  - please don't delete


# Task 2: Build and run the classifier

Now that we have the cryptocurrency articles processed, it is time to return to the original task of building a classifier that can identify cryptocurrency articles.

## Task 2.1: Get the negative examples.

If we want to build a (supervised) machine learning algorithm to detect content, we need both *positive* and *negative* examples.  In fact we want each successive training example to have an equal probability of being positive or negative.

The following cell runs your `crawl` function from Task 1.5 and your `clean_article` function from Task 1.7. Note: We are using `crawl` not `crawl_better` because you may have included data-specific choices in `crawl_better` that are no longer true.

You do not need to modify this cell.

In [32]:
training = [
    'https://en.wikipedia.org/wiki/Tim_Cook',
    'https://en.wikipedia.org/wiki/The_Great_British_Bake_Off',
    'https://en.wikipedia.org/wiki/Google',
    'https://en.wikipedia.org/wiki/Chan_Zuckerberg_Initiative',
    'https://en.wikipedia.org/wiki/Politics',
    'https://en.wikipedia.org/wiki/Fake_news',
    'https://www.snopes.com/fact-check/social-media-hacker-warning/',
    'https://www.cnn.com/2019/08/31/us/dorian-animals-foster-release-wxc/index.html',
    'https://www.foxnews.com/us/indiana-dispatcher-helps-boy-who-called-911-with-fractions-homework',
    'https://www.usatoday.com/story/tech/talkingtech/2019/08/31/hello-iphone-11-new-features-we-want-apple-next-models/2153565001/',
    'http://theconversation.com/bury-fc-the-economics-of-an-english-football-clubs-collapse-122727',
    'https://fivethirtyeight.com/features/economists-are-bad-at-predicting-recessions/'
]

negative = crawl(training)
negative2 = []
for site, content in negative.items():
    article = clean_article(content)
    negative2.append({'site': site, 'text': article})

negative_df = pd.DataFrame(negative2)
display(negative_df)

Unnamed: 0,site,text
0,https://en.wikipedia.org/wiki/Tim_Cook,"Timothy Donald Cook (born November 1, 1960) [3..."
1,https://en.wikipedia.org/wiki/The_Great_Britis...,The Great British Bake Off (often abbreviated ...
2,https://en.wikipedia.org/wiki/Google,Google LLC [5] is an American multinational te...
3,https://en.wikipedia.org/wiki/Chan_Zuckerberg_...,The Chan Zuckerberg Initiative ( CZI ) is a li...
4,https://en.wikipedia.org/wiki/Politics,Politics is a set of activities associated wit...
5,https://en.wikipedia.org/wiki/Fake_news,"Fake news (also known as junk news , pseudo-ne..."
6,https://www.snopes.com/fact-check/social-media...,Snopes needs your help! Learn more . Accepting...
7,https://www.cnn.com/2019/08/31/us/dorian-anima...,"By Madeline Holcombe , CNN Updated 3:20 AM ET,..."
8,https://www.foxnews.com/us/indiana-dispatcher-...,"This material may not be published, broadcast,..."
9,https://www.usatoday.com/story/tech/talkingtec...,Settings Cancel Set Have an existing account? ...


## Task 2.2: Process Document Text

Right now, each Wikipedia article is a single string. This means, we only have one "feature" for the classifier. This is not enough. Tokenization (splitting up the article into words) would transform the data so that we have one feature per word. This probably would give us enough features to train a classifier.

Complete the `get_words` function in the cell below. This function should take a string as input (the raw article).

1. Create an empty list to store the good words.

1. Break the article into sentences using the NLTK sentence tokenizer.

1. Tokenize and part-of-speech tag each sentence.

1. Run the provided `clean_word` function and Porter stemmer on each word.

1. Finally, append the word stem to the list of good words if all of the following are true:
    1. The word stem is of nonzero length.
    2. The word stem has a length less than 20.
    3. The word stem is not a stopword.
    4. The word is a noun.
    5. The word stem is in `vocabulary`. Only apply this rule if `vocabulary` has nonzero length. It has zero length by default.

6. Return the list of good words.

To match our solution, it is important that you do these steps in the given order.

In [0]:
# TODO: Complete the get_words function

sw = set(stopwords.words("english"))
sw.add("'s")
stemmer = PorterStemmer()

#from nltk.tokenize import TweetTokenizer
from nltk.tokenize import sent_tokenize, word_tokenize



def clean_word(word):
    word = word.lower()
    word2 = ''
    for w in word:
        if w.isalpha() or (len(word2) > 0 and w.isnumeric()):
            word2 = word2 + w
    return word2

def get_words(article, vocabulary=[]):
    clean_words = []
    sents=sent_tokenize(article)
    #tokenizer = TweetTokenizer(preserve_case=True, strip_handles=True, reduce_len=True)
    #words_tokens = tokenizer.tokenize(article)
    #print(words_tokens)
    #res=nltk.pos_tag(words_tokens)
    for sent in sents:
      words_tokens = word_tokenize(sent)
      res=nltk.pos_tag(words_tokens)
      for word in res:
        #print(word)
        temp=clean_word(word[0])
        temp_stem = stemmer.stem(temp)
        if (0<len(temp_stem)<20) and (temp_stem not in sw):
          if word[1][0]=='N':
            if vocabulary and (temp_stem in vocabulary):
              clean_words.append(temp_stem)
            if not vocabulary:
              clean_words.append(temp_stem)       
    return clean_words
    
#print(get_words(inference_df['text'][3],vocabulary))

In [85]:
# Sanity check 2.2 for getting the word stems from articles

print(get_words("to be or not to be"))
print(get_words("He wants to test the functionality of xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx in article 091019."))



[]
['function', 'articl']


In [0]:
# Hidden tests 2.2 for autograding get_words  - please don't delete


## Task 2.3 Train the classifier

Adapt the code from the NLTK lecture notebook to complete the `build_classifier` function. This function takes as input the two column dataframes `positive_df` and `negative_df`, and also an optional vocabulary list. It should run `get_words` on each article in each dataframe, get a frequency distribution from NLTK for each article, assemble the training set for a Naive Bayes classifier in the correct format, train the classifier, and return the trained classifier.

In [0]:
# TODO: Complete the build_classifier function
def build_classifier(positive_df, negative_df, vocabulary=[]):
  pos_set = []
  for word in positive_df['text']:
      pos_set.append((nltk.FreqDist(get_words(word,vocabulary)), 'positive'))
  neg_set = []
  for word in negative_df['text']:
      neg_set.append((nltk.FreqDist(get_words(word,vocabulary)), 'negative'))
  classifier = NaiveBayesClassifier.train(pos_set + neg_set)
  return classifier
  



In [87]:
# Sanity check 2.3 for training the classifier
classifier = build_classifier(pages_df, negative_df)
print(type(classifier))

# This should print <class 'nltk.classify.naivebayes.NaiveBayesClassifier'>

<class 'nltk.classify.naivebayes.NaiveBayesClassifier'>


## Task 2.4: Run the classifier

Below are some sample pages.  Let's see if you can run the model on them.

### Task 2.4.1 Load the test set

Adapt the code from Task 2.1 for the new dataset. Call the final dataframe `inference_df`.

In [0]:
# TODO: Create inference_df

test = [
    'https://fried.com/history-of-bitcoin/',
    'https://news.wharton.upenn.edu/press-releases/2018/06/penn-launches-strategic-collaboration-ripple-accelerate-innovation-blockchain-cryptocurrency/',
    'https://en.wikipedia.org/wiki/Euro',
    'https://ew.com/movies/star-wars-rise-of-skywalker-footage-d23-expo/',
    'https://en.wikipedia.org/wiki/Donald_Trump'
]

# YOUR CODE HERE

negative = crawl(test)
negative2 = []
for site, content in negative.items():
    article = clean_article(content)
    negative2.append({'site': site, 'text': article})

inference_df = pd.DataFrame(negative2)

In [89]:
# Sanity check 2.4.1 loading the test set

display(inference_df)

Unnamed: 0,site,text
0,https://fried.com/history-of-bitcoin/,"Follow us! Last updated: August 5th, 2019 Bitc..."
1,https://news.wharton.upenn.edu/press-releases/...,"PHILADELPHIA, PA, June 4, 2018 — The Wharton S..."
2,https://en.wikipedia.org/wiki/Euro,The euro ( sign : € ; code : EUR ) is the offi...
3,https://ew.com/movies/star-wars-rise-of-skywal...,With the new Star Wars: The Rise of Skywalker ...
4,https://en.wikipedia.org/wiki/Donald_Trump,"Donald John Trump (born June 14, 1946) is the ..."


### Task 2.4.2: Inference

Now let's run your classifier over your individual documents. Adapt the code from the NLTK lecture notebook. The function classify should take as input a two column dataframe as we have made previously, the trained classifier, and an optional vocabulary list. It should return a list of booleans. For example, a perfect classifier should return

`classify(inference_df, classifier) = [True, True, False, False, False]`.

Note that you will need to run `get_words` (passing the vocabulary) and then generate an NLTK frequency distribution for each test article.

In [90]:
# TODO: Complete the classify function

def classify(df, classifier, vocabulary=[]):
  
  res=[]
  for words in df['text']:
    test_set=nltk.FreqDist(get_words(words,vocabulary))
    prob_result = classifier.prob_classify(test_set)
    print("p(negative) =", '{:.2f}'.format(prob_result.prob("negative")))
    print("p(positive) =", '{:.2f}'.format(prob_result.prob("positive")))
    if prob_result.max()=='positive':
      res.append(True)
    else:
      res.append(False)
  return res

#results=classify("I usually hate mornings but today was not bad",classifier) 
results=classify(inference_df,classifier)
display(results)
# YOUR CODE HERE




p(negative) = 1.00
p(positive) = 0.00
p(negative) = 1.00
p(positive) = 0.00
p(negative) = 1.00
p(positive) = 0.00
p(negative) = 1.00
p(positive) = 0.00
p(negative) = 1.00
p(positive) = 0.00


[False, False, False, False, False]

In [0]:
# Sanity check 2.4.2 classifier results

if len(results) != 5:
    raise AssertionError('We do not have a classification for each item.')

In [0]:
# Hidden tests 2.4.2 for autograding results  - please don't delete


## Task 2.5: Make the vocabulary and re-classify

So far, our classifier is not very good. This is because it is trying to consider too many words, many of which did or did not occur in the training articles purely by chance. If we restrict the "attention" of the classifier to the most frequent words, it is much more likely to pick up real patterns rather than memorize accidents. We do this by making a vocabulary.



In [0]:
# TODO: Complete the make_vocabulary function

def make_vocabulary(positive_df, negative_df, num):
  pos = []
  neg = []
  for tweet in positive_df['text']: pos += get_words(tweet)
  pdist = nltk.FreqDist(pos)
  for tweet in negative_df['text']: neg += get_words(tweet)
  ndist = nltk.FreqDist(neg)
  sorted_pdist = sorted(pdist.items(), key=lambda kv: kv[1],reverse=True)
  sorted_ndist = sorted(ndist.items(), key=lambda kv: kv[1],reverse=True)
  sorted_pdist=sorted_pdist[:num]
  sorted_ndist=sorted_ndist[:num]
  res1=[]
  res2=[]
  for i in range(num):
    res1.append(sorted_pdist[i][0])
    res2.append(sorted_ndist[i][0])
    
  
  
  return res1+res2
    



In [92]:
# Sanity check 2.5.1 see final vocabulary size

vocabulary = make_vocabulary(pages_df, negative_df, 30)
print(vocabulary)
print(len(vocabulary))

['bitcoin', 'transact', 'cryptocurr', 'ethereum', 'block', 'blockchain', 'network', 'tether', 'price', 'coin', 'currenc', 'exchang', 'system', 'cash', 'time', 'user', 'develop', 'payment', 'softwar', 'account', 'bank', 'mine', 'wallet', 'token', 'dogecoin', 'us', 'market', 'januari', 'trade', 'protocol', 'news', 'googl', 'media', 'seri', 'compani', 'stori', 'govern', 'facebook', 'peopl', 'time', 'inform', 'state', 'websit', 'cook', 'elect', 'internet', 'show', 'trump', 'bbc', 'account', 'year', 'polit', 'fake', 'appl', 'presid', 'bake', 'search', 'world', 'list', 'campaign']
60


In [93]:
# Sanity check 2.5 improved classifier results

classifier_with_vocab = build_classifier(pages_df, negative_df, vocabulary)
results = classify(inference_df, classifier_with_vocab, vocabulary)
display(results)

p(negative) = 0.09
p(positive) = 0.91
p(negative) = 0.01
p(positive) = 0.99
p(negative) = 1.00
p(positive) = 0.00
p(negative) = 0.72
p(positive) = 0.28
p(negative) = 1.00
p(positive) = 0.00


[True, True, False, False, False]

In [0]:
# Hidden tests 2.5 for autograding results  - please don't delete


# Task 3: Submitting Your Homework

1. When you are done, select “Edit” at the top of the window, **under the filename, not the one that may appear above it**. Then, select “Clear all outputs”. Please do this just before turning is your homework because it reduces the size of your file.


2. In the same menu **under the filename**, select “File” and then “Download .ipynb”. It is very important that you do not change the file name of this downloaded notebook. Make sure that something like “(1)” did not get added to the filename and also that you did not download the .py version. Our autograder can only handle .ipynb files with the correct file name.

3. Compress the ipynb file into a Zip file **hw1.zip**.

4. Go to the [submission site](http://submit.dataanalytics.education), and click on the Google icon.  Log in using your Google@SEAS (if at all possible!) or (if you aren’t an Engineering student) GMail account.  

5. Click on the **Courses** icon at the top, then select **CIS 545** and **Save**. Select **cis545-2019c-hw1** and upload **hw1.zip**.

6. You should see a message on the submission site notifying you about whether your submission passed validation.  You may resubmit as necessary, but may have to withdraw your previous submission in OpenSubmit in order to do so.

**If you have not already, please go to Settings and set your Student ID to your PennID (all numbers)**.