![ChinesePod](https://chinesepod.com/instapage/academic_offers/32689996-0-logo-black.png)
# ChinesePod Lesson Downloader

## Introduction 
If you have a subscription to ChinesePod this notebook can be used to automate the downloading of lesson materials for a chosen lesson. Some lessons have video content available to download, but since video files a considerably larger than the audio files I have refrained from providing that feature in this notebook. It will however:
- download available MP3 files related to the lesson 
- create CSV files in [Anki](https://apps.ankiweb.net/) format containing the lesson dialogue, vocabularly, expansion and grammar sections  
- create a summary of the lesson in markdown format which can be easily converted to PDF (e.g., by using [Atom](https://atom.io/) with the package markdown-pdf or [Pandoc](https://pandoc.org/)).

The functions are rather simple and could possibly be improved upon. Also note that there is essentially no error handling included here. A good webscraper should of course try to anticipate as many errors possible.

ChinesePod provides a valuable service and if you use their materials you should have a subscription and download only the materials for the lessons you intend to study. This script was intended to automate and simplify what you can already do on their website.


### NOTES:
I have used Selenium with the Chrome browser. You will need to:
- Install [Chrome](https://www.google.com/chrome/)
- Check the version of Chrome, then get the corresponding *chromedriver* from [here](https://sites.google.com/a/chromium.org/chromedriver/downloads).
- Unzip *chromedriver*
- Put it somewhere sensible. Since I am using a Mac I have moved it to `/usr/local/bin` by 
```bash
mv chromedriver /usr/local/bin
```

 
### TODO:
- Improve exception handling. At the moment only minimal exeception handling has been included. 
- Understand why some utf-8 encoded Chinese characters have a different encoding. At the moment some characters are replaced by hand in process_vocab( ). Need to understand why this occurs and improve handling, so you don't need to replace characters as errors occur. Dragonmapper does not recognise these alternatively encoded Chinese characters.

## Required Packages

In [1]:
from selenium import webdriver # Web scraping
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup # Web scraping
import re # regular expressions
import urllib.parse # convert redirect url, e.g., %2F --> /
import os # path handling
import time # force a wait period
import csv # writing a csv file
import wget # downloading files # install with pip
import numpy as np 
import pandas as pd 
from urllib.error import HTTPError, URLError
import dragonmapper.transcriptions # for detecting pinyin # install with pip
import dragonmapper.hanzi # for detecting hanzi

In [2]:
delay = 20 # In seconds

## Local Folder for Storing Lesson Materials (INPUT REQUIRED)

In [3]:
local_CP_dir = './data'

## ChinesePod URLs 

In [4]:
# Homepage 
baseURL = 'https://chinesepod.com'
# Login page 
login_url = "https://chinesepod.com/accounts/signin"

## Start browser

In [5]:
driver = webdriver.Chrome()

## Login to ChinesePod website (INPUT REQUIRED)

In [6]:

def cp_login(my_email,my_password):
    """
    
    Function to log into ChinesePod.
    
    """
    
    # Load login page
    driver.get(login_url)
    
    # Wait for login buttion to be clickable 
    loginBtn = WebDriverWait(driver,delay).until(EC.element_to_be_clickable((By.CLASS_NAME,'btn')))
    
    # Now log in
    driver.find_element_by_name("email").send_keys(my_email)
    driver.find_element_by_name("password").send_keys(my_password)
    loginBtn.click()
    

In [7]:
# Login to ChinesePod (INPUT REQUIRED)
cp_login("youAre@thisEmail.com","yourPassword") 

## Functions to download data, files and create Anki decks

In [8]:
def make_anki_csv(Chinese,Pinyin,English,AudiofNames,tag,folder,fName):
    """
    Create a csv file with ChinesePod vocabulary.
    
    """
    if Chinese != []:
    
        with open(os.path.join(folder,fName), mode='w') as csv_file:
        
            csv_writer = csv.writer(csv_file, delimiter=',')
            numOfEntries = min([len(Chinese),len(Pinyin),len(English)]) #len(Chinese)
        
            for i in range(numOfEntries):
                
                if AudiofNames == [] or i>= len(AudiofNames):
                
                    csv_writer.writerow([Chinese[i], Pinyin[i], English[i],'NO-FILE',tag])
        
                else: 
                    
                    csv_writer.writerow([Chinese[i], Pinyin[i], English[i],'[sound:'+ AudiofNames[i]+']',tag])
    else:
        print('No vocabulary passed')

In [9]:
# I will ignore comments for now
def get_comments(driver,folder):
    """Ignoring comments."""
    
    print('...... Skipping Comments Tab ......\n')
    

In [10]:
def process_vocab(vocab):
    """ Converts a list of Chinese, pinyin and English strings (vocab) into separate lists.
    It is expected that the list has Chinese, Pinyin, English and then this pattern is 
    repeated throughout the list. However, in some cases we may have something missing
    and this is handled by inserting '-' for the missing string."""
    
    Chinese = []
    Pinyin = []
    English = []
    
    # Used to offset the index (idx below) correctly when something is missing
    offset = 0 
    
    for i,entry in enumerate(vocab):
        
        # Remove the Chinese punctuation symbols. This is sometimes used in the pinyin, which
        # complicates checking whether we have Chinese, Pinyin or English. Numerals also complicate
        # pinyin testing
        test = re.sub('[·！￥……（）——、】【」「’“”；：？、。》，《1234567890。]',' ',entry)
        
        # Escape character
        test = test.replace('\\','')
        
        # There is some issue with the character they use for yi (one)
        # It appears '⼀'.encode('utf-8') returns b'\xe2\xbc\x80' rather than b'\xe4\xb8\x80'
        # which is not recognised by dragonmapper! Also '⼋', '⼗' '⻢',
        test = test.replace('⼀','一')
        test = test.replace('⼋','八')
        test = test.replace('⼗','十')
        test = test.replace('⻢','马')
        test = test.replace('⼭','山')
        test = test.replace('⽺','羊')
   
        # Index must be offset if something is missing
        idx = i+offset
        
        if idx%3 == 0:
            
            # Should be Chinese unless something was missing
            if dragonmapper.hanzi.has_chinese(test) or test.replace(' ','') == '':
                Chinese.append(entry)
            else:
                # Chinese is missing
                Chinese.append('-')
                
                # Do we have pinyin or English?
                if dragonmapper.transcriptions.is_pinyin_compatible(test):
                    # Have pinyin
                    Pinyin.append(entry)
                    offset += 1
                else:
                    # Have English => Also missing pinyin
                    Pinyin.append('-')
                    English.append(entry)
                    offset += 2
                                
        elif idx%3 == 1:
            
            # Should be pinyin unless something was missing
            if dragonmapper.transcriptions.is_pinyin_compatible(test) or test.replace(' ','') == '':
                # It is possible that a string is actually english but is also valid pinyin,
                # e.g., 'language'. In this case we assume it is pinyin. 
                # We use is_pinyin_compatible() because is_pinyin() returns false if it also contains 
                # English, e.g., An English name in a pinyin sentence.
                Pinyin.append(entry)
            else:
                # Pinyin is missing
                Pinyin.append('-')
                
                # Do we have Chinese or English?
                if dragonmapper.hanzi.has_chinese(test):
                    # Have Chinese => Also English missing
                    Chinese.append(entry)
                    English.append('-')
                    offset += 2
                else:
                    # Have English
                    English.append(entry)
                    offset += 1
                    
        elif idx%3 == 2:
            # Should be English unless something was missing
            if not dragonmapper.hanzi.has_chinese(test):
                # It is possible that a string is actually english but is also valid pinyin,
                # e.g., 'language'. In this case we assume it is English. If it is also 
                # valid pinyin we print a warning
                if dragonmapper.transcriptions.is_pinyin(test):
                    print('Expected English word is also valid pinyin')
                    print('Word: ',test)
                English.append(entry)
            else:
                # English was missing
                English.append('-')
                
                # Have Chinese
                Chinese.append(entry)
                offset += 1

    return Chinese,Pinyin,English


In [11]:
def get_dialogue(driver,folder):
    """Has to use driver because the tabs are dynamic."""
        
    print('...... Getting Dialogue Tab ......\n')
    
    
    # Wait for login buttion to be clickable 
    dialogueBtn = WebDriverWait(driver,delay).until(EC.element_to_be_clickable((By.CLASS_NAME,'dialogueTab')))
    #driver.find_element_by_class_name("dialogueTab").click()
    dialogueBtn.click()
    
    # Need to wait for the tab to load otherwise an error will occur
    #time.sleep(delay)
    dialogue = WebDriverWait(driver,delay).until(
        EC.presence_of_element_located((By.ID,'lesson-dialogue')))

    # Make the soup once page is ready
    dialogue_bs = BeautifulSoup(driver.page_source,'html')
    
    # Get dialogue Chinese, pinyin and English
    temp = dialogue_bs.find('div',{'id':'dialogue'}).find_all('p')
    
    # Get text and strip it
    vocab = []
    for entry in temp:
        vocab.append(entry.text.strip())

    # Separates the list of Chinese, pinyin and English strings into separate lists
    [Chinese,Pinyin,English] = process_vocab(vocab)
    
    # Get the audio mp3 urls
    d = dialogue_bs.find_all(href=re.compile(".mp3"))
    
    urls = []
    filenames = []
        
    # Only want the dialogue mp3 urls which are redirected
    redirect = '/redirect/?url='

    for item in d:
    
        temp_url = item['href']

        if redirect in temp_url:
            
            url = urllib.parse.unquote(temp_url.replace(redirect,''))
            
            # Sometimes the filenames are too long!
            # Note that there can be issues when downloading files with long file names with wget.
            # because extra characters are added to the temp files name when it is downloaded!
            # This addition of characters can put it over the filesystems file name limit, which
            # in some cases is 255 character long.
            fname = url.split('/')[-1]
            forig = fname
            isLongName = False
            if len(fname)>=200:
                isLongName = True
                ftype = fname.split('.')[-1]
                print('WARNING: File name was too long, so we have shorten it.')
                print('Originally:')
                print('Length:',len(fname))
                print(fname)
                fname = 'long_name_' + fname[:200] + '.' + ftype
                print('New name:')
                print('Length:',len(fname))
                print(fname)
                
            # Sometimes their audio links are incomplete 
            try:
            
                # Download mp3's
                wget.download(url,out=os.path.join(folder,fname))

                if isLongName and len(forig)<=255:
                    print('The file name was long, but less equal 255 characters.')
                    print('The file will be renamed by its original name')
                    os.rename(os.path.join(folder,fname),os.path.join(folder,forig))
                    fname = forig

                filenames.append(fname)
                urls.append(url)
            
            except HTTPError:
            
                urls.append('MISSING_URL')
                filenames.append('MISSING_FILE')
            
    return Chinese,Pinyin,English,urls,filenames

In [12]:
#[Ch,Pin,Eng,URLS,fNames] = get_dialogue(driver,local_CP_dir)

In [13]:
#print(len(Ch),Ch)
#print(len(Pin),Pin)
#print(len(Eng),Eng)
#print(len(URLS),URLS)
#print(len(fNames),fNames)

In [14]:
def get_vocabulary(driver,folder):
    
    print('...... Getting Vocabulary Tab ......\n')
    
    driver.find_element_by_class_name("vocabularyTab").click()

    # Need to wait for the tab to load otherwise an error will occur
    #time.sleep(delay)
    vocabulary = WebDriverWait(driver,delay).until(
        EC.presence_of_element_located((By.ID,'lesson-vocabulary')))
    
    # Make the soup once page is ready
    vocab_bs = BeautifulSoup(driver.page_source,'html')
    
    # Get dialogue Chinese, pinyin and English
    temp = vocab_bs.find_all(width="30%")
    
    # Get text and strip it
    vocab = []
    for entry in temp:
        vocab.append(entry.text.strip())

    # Separates the list of Chinese, pinyin and English strings into separate lists
    [Chinese,Pinyin,English] = process_vocab(vocab)
    
    # Get the audio mp3 files
    d = vocab_bs.find('div',{'id':'lesson-vocabulary'})
    d = d.find_all(src=re.compile('.mp3'))
    
    urls = []
    filenames = []
    
    for item in d:
        
        url = item['src']
        
        # Sometimes the filenames are too long!
        # Note that there can be issues when downloading files with long file names with wget.
        # because extra characters are added to the temp files name when it is downloaded!
        # This addition of characters can put it over the filesystems file name limit, which
        # in some cases is 255 character long.
        fname = url.split('/')[-1]
        forig = fname
        isLongName = False
        if len(fname)>=200:
            isLongName = True
            ftype = fname.split('.')[-1]
            print('WARNING: File name was too long, so we have shorten it.')
            print('Originally:')
            print('Length:',len(fname))
            print(fname)
            fname = 'long_name_' + fname[:200] + '.' + ftype
            print('New name:')
            print('Length:',len(fname))
            print(fname)
            

        # Sometimes their audio links are incomplete
        try:
            
            # Download mp3's
            wget.download(url,out=os.path.join(folder,fname))

            if isLongName and len(forig)<=255:
                print('The file name was long, but less equal 255 characters.')
                print('The file will be renamed by its original name')
                os.rename(os.path.join(folder,fname),os.path.join(folder,forig))
                fname = forig

            filenames.append(fname)
            urls.append(url)
            
        except HTTPError:
            
            urls.append('MISSING_URL')
            filenames.append('MISSING_FILE')
            
    return Chinese,Pinyin,English,urls,filenames
 

In [15]:
#[Ch,Pin,Eng,URLS,fNames] = get_vocabulary(driver,local_CP_dir)

In [16]:
#print(Ch)
#print(Pin)
#print(Eng)
#print(URLS)
#print(fNames)

In [17]:
def get_expansion(driver,folder):
    
    print('...... Getting Expansion Tab ......\n')
    
    driver.find_element_by_class_name("expansionTab").click()
    
    # Need to wait for the tab to load otherwise an error will occur
    #time.sleep(delay)
    expansion = WebDriverWait(driver,delay).until(
        EC.presence_of_element_located((By.ID,'expansion')))
    
    # Make the soup once page is ready  
    exp_bs = BeautifulSoup(driver.page_source,'html')
    
    temp = exp_bs.find('div',{'id':'expansion'}).find_all('p')
    
    # Get text and strip it
    vocab = []
    for entry in temp:
        vocab.append(entry.text.strip())

    # Separates the list of Chinese, pinyin and English strings into separate lists
    [Chinese,Pinyin,English] = process_vocab(vocab)
    
    # Get the audio mp3 files
    d = exp_bs.find('div',{'id':'expansion'}).find_all(src=re.compile('.mp3'))
        
    urls = []
    filenames = []
    
    for item in d:
        
        url = item['src']
        
        # Sometimes the filenames are too long!
        # Note that there can be issues when downloading files with long file names with wget.
        # because extra characters are added to the temp files name when it is downloaded!
        # This addition of characters can put it over the filesystems file name limit, which
        # in some cases is 255 character long.
        fname = url.split('/')[-1]
        forig = fname
        isLongName = False
        if len(fname)>=200:
            isLongName = True
            ftype = fname.split('.')[-1]
            print('WARNING: File name was too long, so we have shorten it.')
            print('Originally:')
            print('Length:',len(fname))
            print(fname)
            fname = 'long_name_' + fname[:200] + '.' + ftype
            print('New name:')
            print('Length:',len(fname))
            print(fname)
        
        # Sometimes their audio links are incomplete
        try:
        
            # Download mp3's
            wget.download(url,out=os.path.join(folder,fname))

            if isLongName and len(forig)<=255:
                print('The file name was long, but less equal 255 characters.')
                print('The file will be renamed by its original name')
                os.rename(os.path.join(folder,fname),os.path.join(folder,forig))
                fname = forig

            filenames.append(fname)
            urls.append(url)
            
        except HTTPError:
            
            urls.append('MISSING_URL')
            filenames.append('MISSING_FILE')
            
    return Chinese,Pinyin,English,urls,filenames
    

In [18]:
#[Ch,Pin,Eng,URLS,fNames] = get_expansion(driver,local_CP_dir)

In [19]:
#print(len(Ch),Ch)
#print(len(Pin),Pin)
#print(len(Eng),Eng)
#print(len(URLS),URLS)
#print(len(fNames),fNames)

In [20]:
def get_grammar(driver,folder):
    
    print('...... Getting Grammar Tab ......\n')
    
    driver.find_element_by_class_name("grammarTab").click()

    # Need to wait for the tab to load otherwise an error will occur
    #time.sleep(delay)
    grammar = WebDriverWait(driver,delay).until(
        EC.presence_of_element_located((By.ID,'grammar_introduction')))
    
    # Make the soup once page is ready  
    gram_bs = BeautifulSoup(driver.page_source,'html')
    
    # All grammar
    d = gram_bs.find_all('div',{'id':'grammar_introduction'})

    # Now find title, text and example sentences with audio
    Chinese = []
    Pinyin = []
    English = []
    urls = []
    filenames = []
    grammar_text = ''
    
    for item in d:
        
        # Get title
        title = item.find('div',{'class':'panel-title'}).h3.get_text().strip()
        #print(title)

        # Get grammar explanation
        #explanation = item.find('p',{'class':'font-english pad15'}).get_text().strip()
        explanation = item.find('div',{'class':'panel-body xs-pad0'})
        explanation = explanation.find_all('p')
        exp_text = ''
        for p in explanation:
            exp_text += p.get_text().strip() + '\n'
        explanation = exp_text    
        
        #print(explanation)

        grammar_text += title + '\n' + explanation + '\n\n'
        
        # Get example sentences
        try: 
            examples = item.find_all('div',{'id':'grammar_sentence'})
            #examples = item.find('tbody').find_all('tr')

            for item in examples:
            
                text = item.find('tbody').find_all('p')
                
                # Get text and strip it
                vocab = []
                for entry in text:
                    vocab.append(entry.text.strip())

                # Separates the list of Chinese, pinyin and English strings into separate lists
                [Ch,Pin,Eng] = process_vocab(vocab)
                Chinese += Ch
                Pinyin += Pin
                English += Eng
              
                # Get the audio mp3 files
                #d = item.find('td',{'class':'dialogue-audio-cell'}).find('audio')
                d = item.find_all(href=re.compile(".mp3"))
                
                # Only want the dialogue mp3 urls which are redirected
                redirect = '/redirect/?url='

                for item in d:
    
                    temp_url = item['href']

                    if redirect in temp_url:
            
                        url = urllib.parse.unquote(temp_url.replace(redirect,''))  
                
                        # Sometimes the filenames are too long!
                        # Note that there can be issues when downloading files with long file names with wget.
                        # because extra characters are added to the temp files name when it is downloaded!
                        # This addition of characters can put it over the filesystems file name limit, which
                        # in some cases is 255 character long.
                        fname = url.split('/')[-1]
                        forig = fname
                        isLongName = False
                        if len(fname)>=200:
                            isLongName = True
                            ftype = fname.split('.')[-1]
                            print('WARNING: File name was too long, so we have shorten it.')
                            print('Originally:')
                            print('Length:',len(fname))
                            print(fname)
                            fname = 'long_name_' + fname[:200] + '.' + ftype
                            print('New name:')
                            print('Length:',len(fname))
                            print(fname)
                
                        # Sometimes their audio links are incomplete
                        try:
                
                             
                            # Download mp3's
                            wget.download(url,out=os.path.join(folder,fname))

                            if isLongName and len(forig)<=255:
                                print('The file name was long, but less equal 255 characters.')
                                print('The file will be renamed by its original name')
                                os.rename(os.path.join(folder,fname),os.path.join(folder,forig))
                                fname = forig

                            filenames.append(fname)
                            urls.append(url)
            
                    
                        except HTTPError:
            
                            urls.append('MISSING_URL')
                            filenames.append('MISSING_FILE')
                
        except:
            print('No examples found.')
            
        # Removes examples and formats GText nicely    
        blacklist = Chinese + Pinyin + English + ['?','Sample Sentences']#,'(  )+','(\n\n)+']
        temp = grammar_text
        for item in blacklist:
            temp = re.sub(re.escape(item),'',temp)
            
        temp = re.sub('(  )+','',temp)
        temp = re.sub('(\n\n)+','',temp)
        
        grammar_text = temp
        
    return [Chinese,Pinyin,English,grammar_text,urls,filenames]

In [21]:
#[Ch,Pin,Eng,GText,URLS,fNames] = get_grammar(driver,local_CP_dir)

In [22]:
#print(len(Ch),Ch)
#print(len(Pin),Pin)
#print(len(Eng),Eng)
#print(GText)
#print(len(URLS),URLS)
#print(len(fNames),fNames)

In [23]:
def get_practice(driver,folder):
    
    print('...... Getting Practice Tab ......\n')
    
    driver.find_element_by_class_name("practiceTab").click()
    
    # Need to wait for the tab to load otherwise an error will occur
    #time.sleep(delay)
    practice = WebDriverWait(driver,delay).until(
        EC.presence_of_element_located((By.ID,'practice')))
    
    # Make the soup once page is ready  
    prac_bs = BeautifulSoup(driver.page_source,'html')
    
    d = prac_bs.find('div',{'id':'practice'})#.find_all('h1')
    d = d.find_all(href=re.compile("chinesepod.com"))
    
    Text = '**Practice**\n\n'
    
    Text += 'Also see the following:\n\n'
    
    for item in d:

        if item is not None:
            
            link = item['href']

            if 'tab'and '#' in link:
                link = link.split('#')[0]
            
            name = item.get_text().strip()
        
            Text += '[' + name + '](' + link + ')\n'
    
    return Text
    

In [24]:
#TEXT = get_practice(driver,local_CP_dir)

In [25]:
#print(TEXT)

In [26]:
def get_main_mp3_files(lesson_url,driver,folder):
    
    """ This function gets the names, urls and filenames to main mp3 files in addition
    to downloading them. There are typically 3 mp3 files, namely the lesson, the 
    dialogue and the vocabulary, but this can vary from lesson to lesson."""
    
    print('...... Getting Main MP3 Files ......\n')
    
    # Get the html
    driver.get(lesson_url)
    
    #time.sleep(delay)
    playPause = WebDriverWait(driver,delay).until(
        EC.presence_of_element_located((By.CLASS_NAME,'play-pause-control')))
    
    bs = BeautifulSoup(driver.page_source,'html')
    
    list_group = bs.find_all(href=re.compile("chinesepod_"))
    
    names = []
    filenames = []
    urls = []
    
    for item in list_group:
        
        # URL
        url = item['href']
        
        fType = url.split('.')[-1]
        
        if fType == 'mp3':
        
            # Name of the audio, e.g., lesson, dialogue or vocabulary
            try:
                name = item['data-type']
                names.append(name)
            except KeyError:
                name = (item.text).strip()
                
                print(item)
                print(name)
                names.append(name)
                                
            # Sometimes their audio links are incomplete
            try:
                
                #TODO: Does not yet check file name length, but these files generally do not have really long names
                
                # Download mp3's
                filename = wget.download(url,out=folder)
                filename = filename.split('/')[-1]
                filenames.append(filename)
                urls.append(url)
            
            except HTTPError:
            
                urls.append('MISSING_URL')
                filenames.append('MISSING_FILE')
                
            except (TimeoutError,URLError) as e:
                
                print('NOTE: There may be additional audio links in the description which are ignored.')
                urls.append('MISSING_URL')
                filenames.append('MISSING_FILE')
                
        elif fType == 'pdf':
            # Download the pdf
            filename = wget.download(url,out=folder)
            print('*** There was a PDF and it was downloaded ***')
            
    return names,urls,filenames

In [27]:
#[NAMES,URLS,FNAMES] = get_main_mp3_files(lesson_url,driver,local_CP_dir)

In [28]:
#print(NAMES)
#print(URLS)
#print(FNAMES)

In [29]:
def get_navigation_links(bs):
    
    
    nav = bs.find('ul',{'class':'nav nav-tabs menu lesson-menu-tabs'}).find_all('li')
           
    nav_page_names = []
    nav_page_urls = []
    
    for item in nav:
        
        # Get the name of each tab 
        name = item.text.strip()
        
        # Ignore Exercises
        if name != 'Exercises' :
            
            nav_page_names.append(name)
            url = item.a['href']
            nav_page_urls.append(url)

    return nav_page_names, nav_page_urls

In [30]:
#[nav_names,nav_urls] = get_navigation_links(bs)
#print(nav_names)
#print(nav_urls)

In [31]:
def get_nav_materials(url,driver,folder):
    
    
    driver.get(url)
    
    #time.sleep(delay)
    playPause = WebDriverWait(driver,delay).until(
        EC.presence_of_element_located((By.CLASS_NAME,'play-pause-control')))
    
    bs = BeautifulSoup(driver.page_source,'html')
      
    names,urls = get_navigation_links(bs)
    
    numOfNames = len(names)
    
    lesson_text = ''
    
    for i in range(numOfNames):
        
        if names[i] == 'Comments':
            
            get_comments(driver,folder)
            
        elif names[i] == 'Dialogue':
            
            [Ch,Pin,Eng,URLS,fNames] = get_dialogue(driver,folder)
            
            make_anki_csv(Ch,Pin,Eng,fNames,'dialogue',folder,'dialogue.csv')
            
            print(Ch)
            print(Pin)
            print(Eng)
            print(URLS)
            print(fNames)
            
        elif names[i] == 'Vocabulary':
            
            [Ch,Pin,Eng,URLS,fNames] = get_vocabulary(driver,folder)
            
            make_anki_csv(Ch,Pin,Eng,fNames,'vocabulary',folder,'vocabulary.csv')
            
            print(Ch)
            print(Pin)
            print(Eng)
            print(URLS)
            print(fNames)
            
        elif names[i] == 'Expansion':
            
            [Ch,Pin,Eng,URLS,fNames] = get_expansion(driver,folder)
            
            make_anki_csv(Ch,Pin,Eng,fNames,'expansion',folder,'expansion.csv')
            
            print(Ch)
            print(Pin)
            print(Eng)
            print(URLS)
            print(fNames)
            
        elif names[i] == 'Grammar':
            
            [Ch,Pin,Eng,GText,URLS,fNames] = get_grammar(driver,folder)
            
            make_anki_csv(Ch,Pin,Eng,fNames,'grammar',folder,'grammar.csv')
            
            print(Ch)
            print(Pin)
            print(Eng)
            print(GText)
            print(URLS)
            print(fNames) 
                        
            lesson_text += GText + '\n'
        
        elif names[i] == 'Practice':
            
            TEXT = get_practice(driver,folder)
            
            lesson_text += TEXT + '\n'
        
        else:
            
            print('************ UNKNOWN TAB IN NAVIGATION BAR ************')
        
    
    return names,urls,lesson_text

In [32]:
#[NAV_NAMES,NAV_URLS,NAV_TEXT] = get_nav_materials(lesson_url,driver,local_CP_dir)

In [33]:
#print(NAV_TEXT)

In [34]:
def extract_name(bs):
    """
        Extacts the title of the ChinesePod lesson at a url which has been put 
        into the BeautifulSoup object bs.
        
    """
    
    temp = bs.find('h1',{'class':'lesson-page-title'})
    name = temp.find('span',{'itemprop':'name'}).text.strip()
    name = name.replace('/','-') # some have slashes, slashes cause issues with file names
    
    return name

In [35]:
def extract_level(bs):
    """
        Extracts the level of the ChinesePod lesson at a url which has been put 
        into the BeautifulSoup object bs.
        
    """
    
    temp = bs.find('h1',{'class':'lesson-page-title'})
    level = temp.small.text.strip()
    
    return level

In [36]:
def extract_teachers(bs):
    """
        Extracts the teachers of the ChinesePod lesson at a url which has been put 
        into the BeautifulSoup object bs.
    """
    
    Teachers = bs.find('td',{'itemprop':'name'}).text.strip()
    Teachers = Teachers.replace(u'\xa0', u'').split(',')
    
    return Teachers

In [37]:
def extract_key(bs):
    """
        Extracts lesson key of the ChinesePod lesson at a url which has been put 
        into the BeautifulSoup object bs.
        
    """
    
    key = bs.find('span',{'class':'fa fa-key'}).parent.next_sibling.next_sibling.text
    
    return key

In [38]:
def is_video_lesson(bs):
    """
        Determines if the ChinesePod lesson at a url which has been put 
        into the BeautifulSoup object bs is a video lesson.
        
    """
    # ChinesePod uses wistia for hosting the videos
    match = bs.find(src=re.compile('wistia.com'))
    
    tags = extract_tags(bs)
    
    if match or 'video' in tags or 'Video' in tags:
        bIs_video_lesson = True
    else:
        bIs_video_lesson = False
    
    return bIs_video_lesson

In [39]:
def extract_date(bs):
    """
        Extracts the date that the ChinesePod lesson at a url which has been put 
        into the BeautifulSoup object bs was published.
    
    """
    
    datePublished = bs.find('span',{'class':'fa fa-calendar'}).parent.next_sibling.next_sibling.text
    
    return datePublished

In [40]:
def extract_tags(bs):
    """
        Extracts lesson tags if they are present and returns [''] if they are not
        for the ChinesePod lesson at a url which has been put into the BeautifulSoup
        object bs
    
    """
    
    try:
        tags = bs.find('span',{'class':'fa fa-lightbulb-o'}).parent.next_sibling.next_sibling.text
        tags = tags.replace('\n','').split(',')
        tags = list(map(str.strip,tags))
    except AttributeError:
        tags = ['']
        
    return tags

In [41]:
def extract_functions(bs):
    """
        Extracts lesson functions if they are present and returns [''] if they are not.
        
    """
    try:
        functions = bs.find('span',{'class':'fa fa-wrench'}).parent.next_sibling.next_sibling.text
        functions = functions.replace('\n','').split(',') # could be more than one?
        functions =  list(map(str.strip,functions))
    except AttributeError:
        functions = ['']
    
    return functions

In [42]:
def extract_img_url(bs):
    """
        Extracts image url of the ChinesePod lesson at a url which has been put 
        into the BeautifulSoup object bs
    
    """
    
    img_url = bs.find_all('img',{'id':'side-img'})[0]
    img_url = img_url['data-src']
    
    return img_url

In [43]:
def extract_descript(bs):
    """
        Extracts the description of the ChinesePod lesson at a url which has been put 
        into the BeautifulSoup object bs.
    
    """
    
    text = bs.find('div',{'class':'lesson-video-content'}).find_all(text=True)
    
    # If a text element contains any of the black listed items it will not be included in the description
    blackList = ['wistia','unpin video','@']
    
    description = ''
    for t in text:
        t_include = True
        for blItem in blackList:
            if blItem in t:
                t_include = False
        if t_include:       
            description = description + t
            
    # Remove preceding and trailing \n       
    description = description.strip()  
    
    return description

In [44]:
def get_lesson_info(lesson_url,driver,verbose=False):
    """
        Extracts information about the ChinesePod lesson at the url 
        lesson_url and returns it in a dictionary.
    
    """
    
    # Get the html
    driver.get(lesson_url)
    
    #time.sleep(delay)
    fakey = WebDriverWait(driver,delay).until(
        EC.presence_of_element_located((By.CLASS_NAME,'fa-key')))
    
    bs = BeautifulSoup(driver.page_source,'html')

    # Name/Title of the lesson
    lesson_name = extract_name(bs)    
    
    # Level of the lesson
    lesson_level = extract_level(bs)
    
    # Teachers 
    Teachers = extract_teachers(bs)
    
    # Key
    key = extract_key(bs)
    
    # Is it a video lesson?
    video =  is_video_lesson(bs)
    
    # Date published
    datePublished = extract_date(bs)
    
    # Tags
    tags = extract_tags(bs)
    
    # Functions (basically another tag??)
    functions = extract_functions(bs)
    
    # Lesson image url
    img_url = extract_img_url(bs)
    
    # Lesson description
    description = extract_descript(bs)
     
    # OUTPUT:
    #########
    
    # Screen:
    if verbose:
        print('Name: ',lesson_name,'\n')
        print('Lesson url: ',lesson_url,'\n')
        print('Level: ',lesson_level,'\n')
        print('Teachers: ',' | '.join(Teachers),'\n')
        print('Key: ',key,'\n')
        print('Video: ', video,'\n')
        print('Date published: ',datePublished,'\n')
        print('Tags: ',' | '.join(tags),'\n')
        print('Functions: ',' | '.join(functions),'\n')
        print('Lesson image url: \n',img_url,'\n')
        print('Description: \n',description,'\n')
    else:
        print(key + '\t' + lesson_level + '\t' + lesson_name)
    
    # Return everything as a dictionary
    lesson_data = {'Key': key,
                   'Name': lesson_name,
                   'Level': lesson_level,
                   'URL': lesson_url,
                   'Teachers':Teachers,
                   'Date': datePublished,
                   'Video': video,
                   'Tags': tags,
                   'Functions': functions,
                   'ImageURL': img_url,
                   'Description': description }
    
    return lesson_data

In [45]:
#lesson_dic = get_lesson_info(lesson_url,driver,verbose=True)
#print(lesson_dic)

In [46]:
def lesson_summary(lesson_dic,folder,CSV_FILES_NAMES,NAV_TEXT):
    
    print('...... Making Markdown Summary ......\n')
    
    lesson_name = lesson_dic['Name']
    lesson_url = lesson_dic['URL']
    lesson_level = lesson_dic['Level']
    Teachers = lesson_dic['Teachers']
    key = lesson_dic['Key']
    video = lesson_dic['Video']
    datePublished = lesson_dic['Date']
    tags = lesson_dic['Tags']
    functions = lesson_dic['Functions']
    img_url = lesson_dic['ImageURL']
    description = lesson_dic['Description']

    # Video or only audio
    if video:
        lesson_type = 'Video'
    else:
        lesson_type = 'Audio_Only'
    
    # Write lesson summary in a markdown file
    f = open(os.path.join(folder,lesson_level+'-'+lesson_type+'-'+key+'-'+lesson_name+'-Lesson-Summary.md'),'w+')
    f.write('![Image](' + img_url + ')'+'\n')
    f.write('\n')
    f.write('**Name:** '+lesson_name)
    f.write('\n')
    f.write('**Lesson url:** '+lesson_url)
    f.write('\n')
    f.write('**Level:** '+lesson_level)
    f.write('\n')
    f.write('**Teachers:** '+' | '.join(Teachers))
    f.write('\n')
    f.write('**Key:** '+key)
    f.write('\n')
    f.write('**Video:** '+ str(video))
    f.write('\n')
    f.write('**Date published:** '+datePublished)
    f.write('\n')
    f.write('**Tags:** '+' | '.join(tags))
    f.write('\n')
    f.write('**Functions:** '+' | '.join(functions)+'\n')
    f.write('\n')
    f.write('**Description:** \n\n'+description+'\n')
    f.write('\n')
        
    for file in CSV_FILES_NAMES:
        
        fileName = file

        with open(os.path.join(folder,fileName)) as csv_file:
    
            csv_reader = csv.reader(csv_file, delimiter=',')
            Ch = []
            Pin = []
            Eng = []

            for row in csv_reader:
                
                [C,P,E,A,t] = row
                Ch.append(C)
                Pin.append(P)
                Eng.append(E)
    
        numOfLines = len(Ch)
    
        Header = file.split('.')[0].capitalize()
        f.write('**'+Header+'**\n\n')
        f.write('| Chinese' + '\t | \t' + 'Pinyin' + '\t | \t' + 'English |\n')
        f.write('| :---------------- | :-------------------- | :---------------- |\n')
        
        for i in range(numOfLines):
            f.write('| '+Ch[i] + '\t | \t' + Pin[i] + '\t | \t' + Eng[i] + ' | \n')
        
        f.write('\n\n')
    
    f.write(NAV_TEXT)
    
    f.close()

In [47]:
#lesson_summary(lesson_dic,local_CP_dir,['dialogue.csv','vocabulary.csv','expansion.csv'])

In [48]:
def get_lesson_materials(lesson_url,driver,folder):
    
    # Get lesson information
    lesson_dic = get_lesson_info(lesson_url,driver,verbose=True)
    
    # Extract information from dictionary
    lesson_level = lesson_dic['Level']
    key = lesson_dic['Key']
    lesson_name = lesson_dic['Name']
    video = lesson_dic['Video']
    
    # Video or only audio
    if video:
        lesson_type = 'Video'
    else:
        lesson_type = 'Audio_Only'
        
    # Folder name    
    lesson_folder = os.path.join(folder,lesson_level+'-'+lesson_type+'-'+key+'-'+lesson_name)
    
    # If folder doesn't exist, create it
    if not os.path.exists(lesson_folder):
        os.makedirs(lesson_folder) 
    
    # Get main MP3 files
    [NAMES,URLS,FNAMES] = get_main_mp3_files(lesson_url,driver,lesson_folder)
    
    # Get materials from the navigation bar, dialogue, vocabulary etc ...
    [NAV_NAMES,NAV_URLS,NAV_TEXT] = get_nav_materials(lesson_url,driver,lesson_folder)

    # Only want dialogue, vocabulary, expansion and grammar 
    CSV_FILES_NAMES = []
    CSV_LIST = ['dialogue','vocabulary','expansion','grammar']
    
    for name in NAV_NAMES:
        
        name = name.lower()

        if name in CSV_LIST:
            # THERE WILL BE AN ERROR IF THE CSV FILE IS NOT CREATED>>>>> NEED TO CHECK IF IT EXISTS, IF NOT SKIP
            csv_file_name = name+'.csv'
            if os.path.exists(os.path.join(lesson_folder,csv_file_name)):
                CSV_FILES_NAMES.append(csv_file_name)
            
    # Create lesson summary
    lesson_summary(lesson_dic,lesson_folder,CSV_FILES_NAMES,NAV_TEXT)
    


In [49]:
def get_my_selection(selection,driver,directory):
    
    """
    Downloads the selected lessons.
    """
    
    # Create a log file for errors, we are expecting possible TimeoutExceptions
    error_log = open(os.path.join(directory,'error_log.txt'), mode = 'w')
    
    # We will retry a maximum of 5 times
    total_number_of_tries = 5
    
    error_count = 0
        
    for i,url in enumerate(selection):
        
        print('Now getting url (',i,') ', url)
        
        for j in range(0,total_number_of_tries):
        
            try:
        
                get_lesson_materials(url,driver,directory)
                
                # if all ok, then break out of the for loop
                break
            
            except TimeoutException as e:
                
                error_count = j + 1
                
                if error_count < total_number_of_tries:
                    
                    # We will try again, but first lets give their server a short break
                    
                    # Load something else
                    #driver.get('https://www.google.com/')
                    
                    # Wait 30 seconds
                    time.sleep(30)
                    
                    error_log.write('TimeoutException Thrown. Now retrying, you may have multiple copies of audio files\n')
                    error_log.write('\n')
                    error_log.write('Error on lesson at: ')
                    error_log.write('\n')
                    error_log.writer(url)
                    error_log.write('\n')
                    error_log.write('error_count: ' + str(error_count))
                    
                else:
                    
                    error_log.write('TimeoutException Thrown and caught multiple times\n')
                    error_log.write('\n')
                    error_log.write('Error on lesson at: ')
                    error_log.write('\n')
                    error_log.writer(url)
                    error_log.write('\n')
                    error_log.write('error_count: ' + str(error_count))
                    error_log.write('\n')
                    error_log.write('Maximum number of retries reached. You should try to download this lesson again!')
                    error_log.write('\n')
                    error_log.write('******************************************************')
                    error_log.write('\n')

    error_log.close()    
    

## Get the lesson materials for one lesson

In [50]:
lesson_url = 'https://chinesepod.com/lessons/classroom-chinese'
get_lesson_materials(lesson_url,driver,local_CP_dir)

Name:  Classroom Chinese 

Lesson url:  https://chinesepod.com/lessons/classroom-chinese 

Level:  Elementary 

Teachers:  Fiona Tian | Gwilym 

Key:  3081 

Video:  True 

Date published:  May 19, 2017 

Tags:  video | classroom | grammar 

Functions:   

Lesson image url: 
 https://s3contents.chinesepod.com/3081/3be10bce2981e6b7a7c3d7b34ea37ce82f128a8e/images/9fc501635eaa80485bca43cf361be821a900e535.jpg 

Description: 
 14:09     


When you're learning Mandarin, it's important to know some important nouns for the language you're learning, for example, how to say; grammar, new vocabulary, and everyone's favourite...the exam!


Time Word Tips 

...... Getting Main MP3 Files ......

...... Skipping Comments Tab ......

...... Getting Dialogue Tab ......

['生词有问题吗？', '没问题！', '句子有问题吗？', '没问题！', '语法有问题吗？', '没问题！', '那后天的考试应该也没问题吧？', '欸？不是，有好大的问题！', '刚才不是都说没问题吗？']
['shēngcí yǒu wèntí ma？', 'méiwèntí！', 'jùzi yǒu wèntí ma？', 'méiwèntí！', 'yǔfǎ yǒu wèntí ma？', 'méiwèntí！', 'nà hòutiān de kǎos

## Get many lessons at once

We will load previously scraped data and select a lesson from that data.

### Load the data

In [51]:
store = pd.HDFStore(os.path.join(local_CP_dir,'ChinesePod-Library-Cleaned.h5'), 'r') # pip install tables
cp_lesson_df = store['data']
print(len(cp_lesson_df))

4305


We will select the 5 newest Newbie lessons to download. 

In [52]:
g = cp_lesson_df.groupby('Level',sort=False) # sort = False keeps it the same order
newbie_df = g.get_group('Newbie')[:5]
print(len(newbie_df))
newbie_df[['Date','Name','Level','URL']]

5


Unnamed: 0_level_0,Date,Name,Level,URL
Key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4266,2019-12-27,Is learning Chinese difficult?,Newbie,https://chinesepod.com/lesson-redirect?v3id=4266
4244,2019-12-04,Introduction to HSK Test Prep,Newbie,https://chinesepod.com/lesson-redirect?v3id=4244
4254,2019-12-04,Lesson 10 - Don’t Pee On Grandpa Tree,Newbie,https://chinesepod.com/lesson-redirect?v3id=4254
4256,2019-12-04,Lesson 12 - Literally I Love You,Newbie,https://chinesepod.com/lesson-redirect?v3id=4256
4255,2019-12-04,Lesson 11 - Try Me Doc.,Newbie,https://chinesepod.com/lesson-redirect?v3id=4255


Or we could just get the 5 newest

In [53]:
newest_df = cp_lesson_df[:5]
print(len(newest_df))
newest_df[['Date','Name','Level','URL']]

5


Unnamed: 0_level_0,Date,Name,Level,URL
Key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
QW0465,2020-01-10,Five ways of seeing: 看、见、看见、看到、看出,Pre-Intermediate,https://chinesepod.com/lesson-redirect?v3id=QW...
4263,2020-01-08,Animal Proverbs: Become a pig to catch a tiger,Intermediate,https://chinesepod.com/lesson-redirect?v3id=4263
4423,2020-01-06,相见恨晚——老电影,Upper Intermediate,https://chinesepod.com/lesson-redirect?v3id=4423
4272,2020-01-03,Character mind map: 客 guest,Elementary,https://chinesepod.com/lesson-redirect?v3id=4272
4271,2019-12-31,How To Use Action Complete “le” 了,Pre Intermediate,https://chinesepod.com/lesson-redirect?v3id=4271


In [54]:
# Only want to get a few at a time so I don't over burden their server
selection = list(newest_df.URL)

In [55]:
get_my_selection(selection,driver,local_CP_dir)

Now getting url ( 0 )  https://chinesepod.com/lesson-redirect?v3id=QW0465
Name:  Five ways of seeing: 看、见、看见、看到、看出 

Lesson url:  https://chinesepod.com/lesson-redirect?v3id=QW0465 

Level:  Qing Wen 

Teachers:  Joy | Joe 

Key:  QW0465 

Video:  True 

Date published:  January 10, 2020 

Tags:  video | synonyms 

Functions:   

Lesson image url: 
 https://s3contents.chinesepod.com/extra/QW0465/108cafa4617764ef72e53e9272ef46a9854e06cc/images/9de4100c6bba522c134aaadb93fec44e0ee7c362.jpg 

Description: 
 11:19     
In this Qingwen, we answer a question on our forum from user lukeaz: There are so many ways to say "see", including 看、见、看见、看到、看出, how do we distinguish them? 
Let's check out how to use all the different ways to say "see"! 
Forum post: https://forum.chinesepod.com/t/all-the-different-ways-to-say-see/3977 

...... Getting Main MP3 Files ......

...... Skipping Comments Tab ......

...... Getting Dialogue Tab ......

['看', '看电影', '看书', '看花', '看风景', '我要看你的狗的照片。', '见', '我想见Joe。',

Name:  Animal Proverbs: Become a pig to catch a tiger 

Lesson url:  https://chinesepod.com/lesson-redirect?v3id=4263 

Level:  Intermediate 

Teachers:  Joy | Joshua 

Key:  4263 

Video:  True 

Date published:  January 8, 2020 

Tags:  video | animal proverbs | animals 

Functions:   

Lesson image url: 
 https://s3contents.chinesepod.com/4263/400906d886c213694a6fd2a9266e5925e1dabaf6/images/06a1df048c74f9f05a0b4680ef7c891f0bb5a2cf.jpg 

Description: 
 10:07     

Animal metaphors are prevalent in almost every language, but people from different cultures may regard the same animal differently. In Chinese proverbs, you will see a lot of animal metaphors to symbolize human actions.

 In this lesson, we're showing you three common proverbs related to pigs. They can be directly translated to "Men are afraid of being famous, just as pigs are afraid of getting fat", "Disguise oneself as a pig to catch a tiger", and "Even though one has never eaten pork, at least one has seen pigs run". Do 

Expected English word is also valid pinyin
Word:  cunning
['懒惰', '肮脏', '人怕出名猪怕肥', '出名', '招来', '张扬', '扮猪吃老虎', '靠近', '装傻', '没吃过猪肉，也见过猪跑', '常识', '骂', '面试', '出社会', '电邮', '小猪扑满', '财富', '人肉搜索', '假扮', '心机重']
['lǎnduò', 'āngzāng', 'rénpàchūmíngzhūpàféi', 'chūmíng', 'zhāolái', 'zhāngyáng', 'bànzhūchīlǎohǔ', 'kàojìn', 'zhuāngshǎ', 'méichīguòzhūròu,yějiànguòzhūpǎo', 'chángshí', 'mà', 'miànshì', 'chūshèhuì', 'diànyóu', 'xiǎozhūpūmǎn', 'cáifù', 'rénròusōusuǒ', 'jiǎbàn', 'xīnjīzhòng']
['idle', 'dirty', 'Being too famous would ultimately lead to your downfall.', 'to become well known', 'to attract', 'to display ostentatiously', 'To pretend to be weak, dumb or ignorant to take advantage of others.', 'to approach', 'to act stupid', 'Someone who has no knowledge should have common sense.', 'common sense', 'to scold', 'interview', 'to enter the workforce', 'email', 'piggy bank', 'wealth', 'to dox', 'to disguise', 'cunning']
[]
[]
...... Getting Expansion Tab ......

['人家都说‘人怕出名猪怕肥’，你怎么还这么张扬？', '-', '诶，人怕

['写新闻稿能够挖掘我的潜力，激发我的创造性，真是相见恨晚。', '在读书的时候，总会读到一些对我启发很大的话语，让我感到相见恨晚。', '正是生活中的一次次阴差阳错，给了我们相遇、相识的机会。', '他一心热爱文学，却阴差阳错地成为了一名建筑师，不过这并没有改变他每周读两本书的习惯。', '上了年纪的人谈恋爱，就像老房子着了火，一发不可收拾。', '我最初听摇滚音乐，只是想随便试试，结果一见钟情，从此一发不可收拾。', '今天天气这么好，不如我们一起去散步吧！', '虽然我们是同一年上岗，但是我的业绩却远远不如她。']
['xiě xīnwéngǎo nénggòu wājué wǒ de qiánlì，jīfā wǒ de chuàngzàoxìng，zhēnshi xiāngjiànhènwǎn。', 'zài dúshū de shíhou，zǒnghuì dú dào yīxiē duì wǒ qǐfā hěn dà de huàyǔ，ràng wǒ gǎndào xiāngjiànhènwǎn。', 'zhèngshì shēnghuó zhōng de yī cì cì yīnchāyángcuò，gěi le wǒmen xiāngyù、xiāngshí de jīhuì。', 'tā yīxīn rèài wénxué，què yīnchāyángcuò dì chéngwéi le yī míng jiànzhùshī，bùguò zhè bìng méiyǒu gǎibiàn tā měizhōu dú liǎng běn shū de xíguàn。', 'shàngleniánjì de rén tánliànài，jiù xiàng lǎo fángzi zháolehuǒ，yīfābùkěshōushi。', 'wǒ zuìchū tīng yáogǔn yīnyuè，zhǐshì xiǎng suíbiàn shì shì，jiéguǒ yījiànzhōngqíng，cóngcǐ yīfābùkěshōushi。', 'jīntiān tiānqì zhème hǎo，bùrú wǒmen yīqǐ qù sànbù ba！', 'suīrán wǒmen shì tóngyīnián shànggǎng，dànshì wǒ de 

Expected English word is also valid pinyin
Word:  polite
['客气', '请客', '客厅', '客人', '客户', '顾客', '房客', '乘客', '观光客', '播客', '背包客', '黑客', '三剑客']
['kèqi', 'qǐngkè', 'kètīng', 'kèrén', 'kèhù', 'gùkè', 'fángkè', 'chéngkè', 'guānguāngkè', 'bōkè', 'bēibāokè', 'hēikè', 'sānjiànkè']
['polite', 'to treat sb. to a meal or a drink', 'living room', 'guest, customer', 'client', 'customer', 'tenant', 'passenger', 'tourist', 'podcast', 'backpacker', 'hacker (computing) (loanword)', 'the three musketeers']
['https://s3contents.chinesepod.com/4272/733551a9c9f427728c3ad426c96d603db0764f77/mp3/glossary/source/chinesepod_8220_vocabulary_120209_prototype_1577765360.mp3', 'https://s3contents.chinesepod.com/4272/733551a9c9f427728c3ad426c96d603db0764f77/mp3/glossary/source/chinesepod_8220_vocabulary_120210_prototype_1577765368.mp3', 'https://s3contents.chinesepod.com/4272/733551a9c9f427728c3ad426c96d603db0764f77/mp3/glossary/source/chinesepod_8220_vocabulary_120211_prototype_1577765376.mp3', 'https://s3contents.ch

['你都已经喝了这么多了，还喝啊！不过是丢了工作而已，有必要这么难过吗？', '丢了手机后，他突然没事可做了，无聊得很呢！', '记得！吃了饭才能吃药，晚上吃了药才能睡，可别弄反了！', '昨天不小心撞了一辆车，我赔了三千多块。', '昨天因为停电，所以在走廊上跌了一交。', '你们才见了几次面，就决定要结婚了！？这不太好吧...']
['nǐ dōu yǐjīng hē le zhème duō le，hái hē ā！bùguò shì diū le gōngzuò éryǐ，yǒu bìyào zhème nánguò ma？', 'diū le shǒujī hòu，tā tūrán méi shì kě zuò le，wúliáo dehěn ne！', 'jìde！chī le fàn cái néng chīyào，wǎnshang chī le yào cái néng shuì，kě bié nòng fǎn le！', 'zuótiān bù xiǎoxīn zhuàng le yī liàng chē，wǒ péi le sān qiān duō kuài。', 'zuótiān yīnwèi tíngdiàn，suǒyǐ zài zǒuláng shàng diē le yī jiāo。', 'nǐmen cái jiàn le jǐcì miàn，jiù juédìng yào jiéhūn le！？zhè bùtàihǎo ba...']
["You have already drunk a lot, and you're still drinking? It's just a job, you don't have to get so sad!", 'After losing his phone, he has nothing to do now. He’s really bored.', 'Remember! Take the medicine after a meal and sleep after taking the medicine. Don’t get it wrong.', "I crashed someone's car by accident yesterday and it cost me 3000 dollars 

## When your finished

In [56]:
#driver.close() # closes the current window

In [57]:
# Closes the browser and shuts down the ChromeDriver executable
# that is started when starting the ChromeDriver
driver.quit() 