# Data Scraping Demonstration

This notebook demonstrates how to scrape data from the website https://cryptocoincharts.info/.

In [1]:
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

Firstly, let's read all the data in the HTML, and print out the first 300 characters.

In [2]:
webdata = requests.get('https://cryptocoincharts.info/')

In [3]:
print(webdata.text[0:300])

<!DOCTYPE html>
<html lang="en">
	<head>
		<title>Bitcoin and Altcoin price charts / graphs</title>
		<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />		
		<link rel="apple-touch-icon" sizes="180x180" href="/img/favicons/apple-touch-icon.png">
		<link rel="icon" type="image/png" 


## Parsing HTML

In [4]:
soup = BeautifulSoup(webdata.text, 'html.parser')

## Example : Extract all the names

Before extracting all the data, let's extract the name of cryptocurrency to demonstrate how my method works.

After looking into the source code, I noticed taht each name has the
following format: 

$<a  href="/coins/show/btc" class="link">Bitcoin</a>$

In [5]:
# Use the pattern of names, we can extract all lines start with 'a' and has the class 'link
raw_names = soup.find_all('a', attrs={'class':'link'})
names = raw_names[2:102] # there's 100 names in a page

all_names = []
for name in names:
    all_names.append(name.text)
# let's check if we extracted the names
all_names[:10]

['Bitcoin',
 'Ether',
 'Ripple',
 'Bitcoin Cash',
 'EOS',
 'Stellar Lumens',
 'Litecoin',
 'Cardano',
 'Tether',
 'Monero']

In [6]:
#Check if there's 100 name
len(all_names)

100

Yes, the it's confirmed that there's 100 names and we successfully extract all the names.

## Extract All Variables

After looking into the soucecode, I found each cryptocurrency has the pattern:

<img src = 'pattern.png'>

Using the pattern, we can extract all the variables.


In [7]:
mysoup = soup.find_all('td')
prices_raw = []
volumnes_24h_raw = []
marketcups_raw = []
supplies_raw = []
market_shares_raw = []
for i in range(100):
    price = mysoup[3+12*i].text[12:-11]
    volumne_24h = mysoup[6 + 12*i].text
    marketcup = mysoup[7 + 12 *i].text
    supply = mysoup[8 + 12 *i].text
    market_share = mysoup[9 + 12 * i].text[:-1]
    prices_raw.append(price)
    volumnes_24h_raw.append(volumne_24h)
    marketcups_raw.append(marketcup)
    supplies_raw.append(supply)
    market_shares_raw.append(market_share)

In [8]:
data_raw_keys = {'Names': all_names,
        'Price': prices_raw,
       '24h_Volumn': volumnes_24h_raw,
       'Marketcap': marketcups_raw,
       'Supply': supplies_raw,
       'Market_Share': market_shares_raw}

In [9]:
data_page_1_raw = pd.DataFrame(data = data_raw_keys)
cols = ['Names', 'Price','24h_Volumn','Marketcap','Supply','Market_Share']
data_page_1_raw = data_page_1_raw[cols]

In [10]:
data_page_1_raw.head(10)

Unnamed: 0,Names,Price,24h_Volumn,Marketcap,Supply,Market_Share
0,Bitcoin,"$7,364","$803,468,904","$127,027,617,038","17,250,187 BTC",51.0
1,Ether,$302,"$418,767,681","$28,977,184,852","101,765,204 ETH",12.0
2,Ripple,$0.3341,"$124,228,530","$13,056,340,384","39,650,153,121 XRP",5.21
3,Bitcoin Cash,$628,"$225,743,243","$10,933,654,446","17,331,475 BCH",4.37
4,EOS,$6.45,"$371,558,107","$5,892,052,569","906,245,118 EOS",2.35
5,Stellar Lumens,$0.2207,"$14,307,514","$4,298,185,852","18,773,722,237 XLM",1.72
6,Litecoin,$65,"$70,277,231","$3,979,669,801","58,154,754 LTC",1.59
7,Cardano,$0.1041,"$15,854,442","$2,733,911,021","25,927,070,538 ADA",1.09
8,Tether,$0.9960,"$5,003,645","$2,647,962,416","2,767,140,336 USDT",1.06
9,Monero,$120,"$30,619,103","$2,278,009,280","16,379,415 XMR",0.91


So far, we extract all the data from the website!

Let's clean our data so we can export them into a csv file!

## Data Cleanning

In [11]:
def take_dollar_sign(data):
    """Input a list or an array, the function takes away dollar sign 
    and convert data type from string to float, return a list"""
    l = []
    for item in data:
        no_sign_and_comma = item.replace('$', '').replace(',','')
        l.append(float(no_sign_and_comma))
    return l

In [12]:
def take_string(data):
    """Input a list or an array, the function takes away strings, 
    convert data type from string to float, and return a list"""
    no_str = []
    for item in data:
        no_comma = item.replace(',','')
        no_strings = ''.join([i for i in item if i.isdigit()])
        no_str.append(float(no_strings))
    return no_str

In [13]:
prices_clean = take_dollar_sign(prices_raw)
volumnes_24h_clean = take_dollar_sign(volumnes_24h_raw)
marketcups_clean = take_dollar_sign(market_shares_raw)
supplies_clean = take_string(supplies_raw)

In [14]:
data_clean_keys = {'Names': all_names,
        'Price($)': prices_clean,
       '24h_Volumn($)': volumnes_24h_clean,
       'Marketcap($)': marketcups_clean,
       'Supply': supplies_clean,
       'Market_Share(%)': market_shares_raw}

In [15]:
data_page_1_clean = pd.DataFrame(data = data_clean_keys)
cols = ['Names', 'Price($)','24h_Volumn($)','Marketcap($)','Supply','Market_Share(%)']
data_page_1_clean = data_page_1_clean[cols]

In [16]:
data_page_1_clean.head(10)

Unnamed: 0,Names,Price($),24h_Volumn($),Marketcap($),Supply,Market_Share(%)
0,Bitcoin,7364.0,803468904.0,51.0,17250190.0,51.0
1,Ether,302.0,418767681.0,12.0,101765200.0,12.0
2,Ripple,0.3341,124228530.0,5.21,39650150000.0,5.21
3,Bitcoin Cash,628.0,225743243.0,4.37,17331480.0,4.37
4,EOS,6.45,371558107.0,2.35,906245100.0,2.35
5,Stellar Lumens,0.2207,14307514.0,1.72,18773720000.0,1.72
6,Litecoin,65.0,70277231.0,1.59,58154750.0,1.59
7,Cardano,0.1041,15854442.0,1.09,25927070000.0,1.09
8,Tether,0.996,5003645.0,1.06,2767140000.0,1.06
9,Monero,120.0,30619103.0,0.91,16379420.0,0.91


Our data was cleaned! Let's export them!

## Export Data

In [17]:
data_page_1_clean.to_csv('data_page_1_clean.csv', index = False, encoding = 'utf-8')

Done and done!