### Part 1: Data Mining

#### The objectives of this project are:
- To mine data related to top audiobooks from LibriVox released between 2018-2022, with the help of Beautiful Soup and Whisper speech recognition model.
- To perform Named Entity Recognition (NER) on the text extracted using Spacy.
- To train a model for predicting views on English Language audiobooks.


#### This Project is divided into three parts:
1. Data Mining (ipynb file: 1.MinePlayers_Mining)
Input - Librivox archive URL
Output - librivox_df (csv file attached)

2. Named Entity Recognition (ipynb file: 2.MinePlayers_NER)
Input - librivox_df (csv file attached) from Part 1.
Output - en_books_df (csv file attached)
##### Other attached:
Excel file of names sorted based on gender - 'Sorted_names.xlsx'

3. Data Preprocessing and Predictive Analysis (ipynb file: 3.MinePlayers_Pred)
Input - en_books_df (csv file attached) from Part 2.
Final result - Trained Linear Regression Prediction Model



### Part 1: Data Mining 

In [176]:
import requests
from bs4 import BeautifulSoup
import whisper
import pandas as pd
import re
import datetime

Step 1. Extraction of links from LibriVox Archive Page using requests and BeautifulSoup.

In [144]:
def getAllLinks(url):
    # archives
    archives = requests.get(url)
    archive_soup = BeautifulSoup(archives.content, 'html.parser')
    item = archive_soup.find_all('div', {'class':'item-ttl'})
    
    # get links for book_soup
    all_links = []
    for i in item:
        a_link = 'https://archive.org{}'.format((i.find('a')['href']))
        all_links.append(a_link)
    return all_links

Step 2. Parsing links from Step1 to extract 'Title', 'Author', 'Views', 'Favorites','Date_uploaded' columns. 

In [91]:
def booksDF(L):
    
    # book_soup
    books_list = []
    all_reviews = []
    for link in L:
        book = requests.get(link)
        book_soup = BeautifulSoup(book.content, 'html.parser')
        count = book_soup.find_all('span', {'class':'item-stats-summary__count'})
        title = book_soup.find('span', {'class':'breaker-breaker'}).text
        author = book_soup.find('a', {'rel':'nofollow'}).text
        date = book_soup.find('time', {'itemprop': 'uploadDate'}).text
        views = int(count[0].text.replace(',',''))
        if len(count) > 1:
            fav = count[1].text
        else:
            fav = ''
    
        books_list.append([title,author,views,fav,date])
        
        reviews = book_soup.find_all('div', {'class': 'aReview'})
        review_list = []
        for i in reviews:
            rev = i.find('div', {'class': 'breaker-breaker'}).text
            review_list.append(rev)
        all_reviews.append(review_list)
    
    # build books Df
    books_df = pd.DataFrame(books_list,
                            columns = ['Title', 'Author', 'Views', 'Favorites','Date_uploaded'])
    books_df['Reviews'] = all_reviews
    num_rev = books_df['Reviews'].apply(len)
    books_df['Reviews_n'] = num_rev
    
    return books_df

Note: Filtered the books list by year and extracted URL. From here, books dataframe for 5 years was obtained.

In [197]:
url_18 = 'https://archive.org/details/librivoxaudio?and%5B%5D=subject%3A%22audiobooks%22&and%5B%5D=mediatype%3A%22audio%22&and%5B%5D=year%3A%222018%22&sort=-week'
links18 = getAllLinks(url_18)

In [173]:
df_2018 = booksDF(links18)

In [203]:
url_19 = 'https://archive.org/details/librivoxaudio?sort=-week&and[]=subject%3A%22audiobooks%22&and[]=mediatype%3A%22audio%22&and[]=year%3A%222019%22'
links19 = getAllLinks(url_19)

In [174]:
df_2019 = booksDF(links19)

In [215]:
url_20 = 'https://archive.org/details/librivoxaudio?sort=-week&and[]=subject%3A%22audiobooks%22&and[]=mediatype%3A%22audio%22&and[]=year%3A%222020%22'
links20 = getAllLinks(url_20)

In [171]:
df_2020 = booksDF(links20)

In [148]:
url_21 = 'https://archive.org/details/librivoxaudio?sort=-week&and[]=subject%3A%22audiobooks%22&and[]=mediatype%3A%22audio%22&and[]=year%3A%222021%22'
links21 = getAllLinks(url_21)

In [172]:
df_2021 = booksDF(links21)

In [76]:
url_22 = 'https://archive.org/details/librivoxaudio?sort=-week&and[]=subject%3A%22audiobooks%22&and[]=mediatype%3A%22audio%22&and[]=year%3A%222022%22'
links22 = getAllLinks(url_22)

In [92]:
df_2022 = booksDF(links22)

Step3. Extracted Audiobook links for chapter 1 and used Whisper's tiny model to convert speech to text.

In [111]:
model = whisper.load_model("tiny")

In [56]:
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

In [388]:
def getAudio(L):
    
    # book_soup
    catalog_links = []
    for link in L:
        book = requests.get(link)
        book_soup = BeautifulSoup(book.content, 'html.parser')
        descript = book_soup.find('div', {'id':'descript'})
        book_link = descript.find('a','href' == True, text = re.compile(('LibriVox catalog'),re.IGNORECASE))
        # LibriVox catalog page
        if book_link.get('href') == '':
            catalog_links.append('')
        else:
            catalog_links.append(book_link.get('href'))         #book_link.get('href'))
        
    
    audio_links = []
    for c in catalog_links:
        if c == '':
            audio_links.append('')
        else:
            book_req = requests.get(c, verify=False)
            catalog_soup = BeautifulSoup(book_req.content,'html.parser')
            audio_link = catalog_soup.find('td').find('a')['href']

            audio_links.append(audio_link)
    return audio_links
    

In [79]:
# audio 22
Ch1_links = getAudio(links22)

In [152]:
audio21 = getAudio(links21)

In [153]:
audio20 = getAudio(links20)

In [389]:
audio19 = getAudio(links19)

In [249]:
audio18 = getAudio(links18)

Note: The below chunk takes upto 2.5 hours per table to run. 

In [136]:
# 2022 was used as a practice chunk to determine how much time whisper takes for one table
# and that is why it is not inside a function.
transcripts = []
for l in Ch1_links:
    ch = model.transcribe(l, fp16 = False)
    transcripts.append(ch['text'][170:])

In [396]:
def speech2Text(L):
    
    transcripts = []
    for l in L:
        if l == '':
            transcripts.append('')
        else:
            ch = model.transcribe(l, fp16 = False)
            transcripts.append(ch['text'][170:])
        
    return transcripts

Note: The below chunks take upto 2.5 hours per table to run. (from transcripts_21 to transcripts_18)

In [161]:
transcripts_21 = speech2Text(audio21)

In [162]:
transcripts_20 = speech2Text(audio20)

In [400]:
transcripts_19 = speech2Text(audio19)

In [254]:
transcripts_18 = speech2Text(audio18)

In [139]:
df_2022['Transcripts'] = transcripts

In [184]:
df_2021['Transcripts'] = transcripts_21

In [189]:
df_2020['Transcripts'] = transcripts_20

In [402]:
df_2019['Transcripts'] = transcripts_19

In [266]:
df_2018['Transcripts'] = transcripts_18

In [441]:
#df_2022.to_csv('../df_2022', index = False, header=True)

In [442]:
#df_2021.to_csv('../df_2021', index = False, header=True)

In [443]:
#df_2020.to_csv('../df_2020', index = False, header=True)

In [444]:
#df_2018.to_csv('../df_2018', index = False, header=True)

In [467]:
#df_2019.to_csv('../df_2019', index = False, header=True)

Step 4. Additional Columns from the LibriVox Catalog Page were added. 'Genre','Language','Runtime','Read_by'

In [409]:
def getCatalogLinks(L):
    catalog_links = []
    for link in L:
        book = requests.get(link)
        book_soup = BeautifulSoup(book.content, 'html.parser')
        descript = book_soup.find('div', {'id':'descript'})
        book_link = descript.find('a','href' == True, text = re.compile(('LibriVox catalog'),re.IGNORECASE))
        # LibriVox catalog page
        if book_link.get('href') == '':
            catalog_links.append('')
        else:
            catalog_links.append(book_link.get('href'))
    
    return catalog_links
    

In [424]:
def addnlColumns(L):
    addnl_columns = []
    for c in L:
        if c == '':
            addnl_columns.append('')
        else:
            book_req = requests.get(c, verify=False)
            catalog_soup = BeautifulSoup(book_req.content,'html.parser')
            p = catalog_soup.find_all('p', {'class':'book-page-genre'})  
            genre = p[0].text.split("Genre(s): ")[1]
            language = p[1].text.split("Language: ")[1]
            det = catalog_soup.find('dl', {'class': 'product-details clearfix'})
            details = det.find_all('dd')
            runtime = details[0].text
            read_by = details[3].text
            addnl_columns.append([genre,language,runtime,read_by])
    return addnl_columns

In [425]:
addnl_2022 = addnlColumns(getCatalogLinks(links22))

In [427]:
df_2022[['Genre','Language','Runtime','Read_by']] = addnl_2022

In [429]:
addnl_2021 = addnlColumns(getCatalogLinks(links21))

In [433]:
df_2021[['Genre','Language','Runtime','Read_by']] = addnl_2021

In [430]:
addnl_2020 = addnlColumns(getCatalogLinks(links20))

In [434]:
df_2020[['Genre','Language','Runtime','Read_by']] = addnl_2020

In [431]:
addnl_2019 = addnlColumns(getCatalogLinks(links19))

In [461]:
df_addnl_2019 = pd.DataFrame(addnl_2019)

In [462]:
df_addnl_2019.columns = ['Genre','Language','Runtime','Read_by']

In [465]:
df_2019 = pd.concat([df_2019,df_addnl_2019], axis = 1)

In [432]:
addnl_2018 = addnlColumns(getCatalogLinks(links18))

In [440]:
df_2018[['Genre','Language','Runtime','Read_by']] = addnl_2018

Step5. Concatenated all tables from 2018-2022 to get final table.

In [472]:
frames = [df_2018, df_2019, df_2020, df_2021, df_2022]

In [473]:
all_books_df = pd.concat(frames)

In [478]:
all_books_df = all_books_df.rename(columns={'Read_by':'Narrated_by'})

In [None]:
all_books_df.to_csv('../librivox_df', index = False, header=True)

#### Final Table
(Attached with the files as 'librivox_df')

In [484]:
pd.read_csv('../librivox_df')

Unnamed: 0,ID,Title,Author,Views,Favorites,Date_uploaded,Reviews,Reviews_n,Transcripts,Genre,Language,Runtime,Narrated_by
0,0,Multilingual Short Works Collection 020 - Poet...,Various,594092,2.0,"October 16, 2018",[],0,informoi al porvon tuli on volu visiti librevo...,"Poetry, Short Stories",Multilingual,02:52:08,LibriVox Volunteers
1,1,The Book of Enoch,Unknown,155927,102.0,"December 9, 2018",['\n I enjoyed this recording very much...,1,"r, please visit Librevox.org. Section 1. Edito...",Religion,English,04:28:56,CJ Plogue
2,2,Aventuras de Arturo Gordon Pym,Edgar Allan Poe,445836,3.0,"December 7, 2018",[],0,ón para ser voluntario por favor visite LibriB...,"Gothic Fiction, Nautical & Marine Fiction",Spanish,08:05:35,Mongope
3,3,The Meditations of the Emperor Marcus Aurelius...,Marcus Aurelius,302755,58.0,"January 2, 2018",['\n I recently discovered this gem of ...,1,"ation or to volunteer, please visit Librevox.o...","Classics (Greek & Latin Antiquity), Biography ...",English,04:47:46,LibriVox Volunteers
4,4,Three Things,Ella Wheeler Wilcox,516824,7.0,"July 1, 2018",[],0,"gs there are, eternal in their worth. Love tha...",Multi-version (Weekly and Fortnightly poetry),English,00:12:59,LibriVox Volunteers
...,...,...,...,...,...,...,...,...,...,...,...,...,...
370,70,"Dramatic Reading Scene and Story Collection, V...",Various,56821,2.0,"April 30, 2022",[],0,"ng, all Librevox recordings are in the public ...",Dramatic Readings,English,03:17:50,LibriVox Volunteers
371,71,Las guerras ibéricas,Appian of Alexandria,64320,2.0,"May 14, 2022",[],0,"oluntario, por favor piscite libribox.org. Las...","War & Military, Antiquity",Spanish,02:41:49,Epachuko
372,72,"The Mormon Battalion, Its History and Achievem...",B. H. Roberts,29381,1.0,"November 4, 2022",['\n If you are someone who is really f...,1,"to volunteer, please visit Libravox.org. Read ...","War & Military, History",English,02:51:09,Wayne Cooke
373,73,Dogs and Puppies,Frances Trego Montgomery,32539,2.0,"November 10, 2022",[],0,riVox.org. Read by Prajakta. Docs and Puppets ...,Action & Adventure,English,00:51:56,LibriVox Volunteers
