<a href="https://colab.research.google.com/github/cecilylynn/data-science-projects/blob/main/Love_Island_S23_Constestant_Scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Love Island Summer 2023 Contestant Data Scraper

As part of a larger Love Island data project, I need the basic info on each of the Islanders.

From [Love Island 2023 cast: Full line-up of season 10 contestants](https://www.radiotimes.com/tv/entertainment/reality-tv/love-island-summer-2023-cast-line-up/), we can scrape the first and last names, ages, and instagram handles for each islander, using Real Python's [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/) as a guide.



##The Scraper

###Installing and importing packages

In [40]:
#Installs

!pip install requests
!pip install beautifulsoup4



In [41]:
#Imports

import pandas as pd
import requests
from bs4 import BeautifulSoup

###Setting up the scraper

First, we need to grab the HTML from the page we want to scrape and parse it into a BeautifulSoup object

In [42]:
URL = "https://www.radiotimes.com/tv/entertainment/reality-tv/love-island-summer-2023-cast-line-up/"
page = requests.get(URL)

#print(page.text)

soup = BeautifulSoup(page.content, "html.parser")


Get down into the HTML to narrow towards what we want

In [43]:
page_text = soup.find(class_="editor-content mb-lg hidden-print js-piano-locked-content")


###Getting the names of the Islanders

In [44]:
#getting bs4 resultset of all the names using the fact that each name appears wrapped in an "h2" tag
names = page_text.find_all("h2")

#doing a bit of cleaning, because there was some extraneous data in our results set
names.remove(names[0])
names.remove(names[24])
names.remove(names[24])

We can extract the names from the html and use str methods to clean up the strings and split the names into first and last names, saving them to lists to add to our pandas dataframe later.

In [45]:
first_names = []
last_names= []

for name in names:
  name_text = name.text.split(' - ', 1)[0]
  first_names.append(name_text.split(' ', 1)[0].strip())
  last_names.append(name_text.split(' ', 1)[1].strip())


###Getting Instagram URLs and handles

Using the "a" tag to "strain" the instagram info from our soup. Creating lists for the URLs and handles, adding the data to the lists, and printing to check that it's done correctly.

In [46]:
links = page_text.find_all("a")

insta_urls=[]
insta_handles=[]

index = 0
for link in links:
  if 'instagram' in link['href']:
    insta_urls.append(link['href'])
    insta_handles.append(link.text)
    print(index, insta_urls[index], insta_handles[index])
    index +=1

0 https://www.instagram.com/mollygracemarsh/?hl=en @mollygracemarsh
1 https://www.instagram.com/daniellemazhindu/ @daniellemazhindu
2 https://www.instagram.com/abimooresxox/?hl=en @abimooresxox
3 https://www.instagram.com/tinkreadingxo/?hl=en-gb @tinkreadingxo
4 https://www.instagram.com/kodiemurphy/?hl=en @kodiemurphy
5 https://www.instagram.com/ouzysee_/ @ouzysee_
6 https://www.instagram.com/z_ashford/?hl=en @z_ashford
7 https://www.instagram.com/_truegains/?hl=en @_truegains
8 https://www.instagram.com/benjaminn_noell/?hl=en @benjaminn_noell
9 https://www.instagram.com/kadymcdermott/?hl=en @kadymcdermott
10 https://www.instagram.com/montelmckenzie/ @montelmckenzie
11 https://www.instagram.com/scottvds17/ @scottvds17
12 https://www.instagram.com/leahjtaylorr/?hl=en @leahjtaylorr
13 https://www.instagram.com/sammyroot_/ @sammyroot_
14 https://www.instagram.com/whitbrownsx/ @whitbrownsx
15 https://www.instagram.com/zachariah_noble97/ @zachariah_noble97
16 https://www.instagram.com/tyri

Fortunately, these are in the same order as the islanders names, which will make matching them up easier.

However, we immediately notice that there are fewer instagrams than there are islanders. We manually verify that these are missing from the website, so it's not a problem with the scraper.

To deal with this, I'll manually find the instagrams of the missing islanders (or determine that they're unavailable) and insert them at the proper point in our lists.

**Missing:**

Amber Wise @amberwse \\
Gabby Jeffery @gabriellejefferyy \\
Lochan Nowacki @lochan_nowacki \\
George Fensom @georgefensom


In [47]:
#Amber
index = first_names.index('Amber')
url = 'https://www.instagram.com/amberwse/'
handle = '@amberwse'

insta_urls.insert(index, url)
insta_handles.insert(index, handle)

#Gabby
index = first_names.index('Gabby')
url = 'https://www.instagram.com/gabriellejefferyy/'
handle = '@gabriellejefferyy'

insta_urls.insert(index, url)
insta_handles.insert(index, handle)

#Lochan
index = first_names.index('Lochan')
url = 'https://www.instagram.com/lochan_nowacki/'
handle = '@lochan_nowacki'

insta_urls.insert(index, url)
insta_handles.insert(index, handle)

#George
index = first_names.index('George')
url = 'https://www.instagram.com/georgefensom/'
handle = '@georgefensom'

insta_urls.insert(index, url)
insta_handles.insert(index, handle)

Let's print out our new lists to verify that the insertions were completed correctly:

In [48]:
index = 0
for name in first_names:
  print(index, first_names[index], insta_urls[index], insta_handles[index])
  index +=1

0 Molly https://www.instagram.com/mollygracemarsh/?hl=en @mollygracemarsh
1 Danielle https://www.instagram.com/daniellemazhindu/ @daniellemazhindu
2 Abi https://www.instagram.com/abimooresxox/?hl=en @abimooresxox
3 Amber https://www.instagram.com/amberwse/ @amberwse
4 Tink https://www.instagram.com/tinkreadingxo/?hl=en-gb @tinkreadingxo
5 Gabby https://www.instagram.com/gabriellejefferyy/ @gabriellejefferyy
6 Kodie https://www.instagram.com/kodiemurphy/?hl=en @kodiemurphy
7 Ouzy https://www.instagram.com/ouzysee_/ @ouzysee_
8 Zachary https://www.instagram.com/z_ashford/?hl=en @z_ashford
9 Elom https://www.instagram.com/_truegains/?hl=en @_truegains
10 Lochan https://www.instagram.com/lochan_nowacki/ @lochan_nowacki
11 Benjamin https://www.instagram.com/benjaminn_noell/?hl=en @benjaminn_noell
12 Kady https://www.instagram.com/kadymcdermott/?hl=en @kadymcdermott
13 Montel https://www.instagram.com/montelmckenzie/ @montelmckenzie
14 Scott https://www.instagram.com/scottvds17/ @scottvds17


###Manually creating a list for "Sex"

In [49]:
sex = ['F',
       'F',
       'F',
       'F',
       'F',
       'F',
       'M',
       'M',
       'M',
       'M',
       'M',
       'M',
       'F',
       'M',
       'M',
       'F',
       'M',
       'F',
       'M',
       'M',
       'F',
       'F',
       'M',
       'F',
       'M',
       'F',
       'M',
       'F',
       'F',
       'M']

##Putting it all together in a pandas dataframe

Since we saved everything in lists, we can easily construct a pandas dataframe to hold all our information.

In [50]:
islanders_df=pd.DataFrame({'First_Name': first_names,
                           'Last_Name': last_names,
                           'Sex': sex,
                           'Insta_URLs': insta_urls,
                           'Insta_Handles': insta_handles})

In [51]:
islanders_df

Unnamed: 0,First_Name,Last_Name,Sex,Insta_URLs,Insta_Handles
0,Molly,Marsh,F,https://www.instagram.com/mollygracemarsh/?hl=en,@mollygracemarsh
1,Danielle,Mazhindu,F,https://www.instagram.com/daniellemazhindu/,@daniellemazhindu
2,Abi,Moores,F,https://www.instagram.com/abimooresxox/?hl=en,@abimooresxox
3,Amber,Wise,F,https://www.instagram.com/amberwse/,@amberwse
4,Tink,Reading,F,https://www.instagram.com/tinkreadingxo/?hl=en-gb,@tinkreadingxo
5,Gabby,Jeffery,F,https://www.instagram.com/gabriellejefferyy/,@gabriellejefferyy
6,Kodie,Murphy,M,https://www.instagram.com/kodiemurphy/?hl=en,@kodiemurphy
7,Ouzy,See,M,https://www.instagram.com/ouzysee_/,@ouzysee_
8,Zachary,Ashford,M,https://www.instagram.com/z_ashford/?hl=en,@z_ashford
9,Elom,Ahlijah-Wilson,M,https://www.instagram.com/_truegains/?hl=en,@_truegains


##Exporting the result as a CSV:

This assumes that you're running this notebook in Google Colab.

Download CSV to computer

In [52]:
from google.colab import files

islanders_df.to_csv('islanders_S23.csv', encoding = 'utf-8-sig')
files.download('islanders_S23.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>