In this notebook, we collect as many links from the website of letsintern as we can. We use BeautifulSoup to scrape the data.

Also, note that the way links have been extracted is a lil weird or different if I may say. That is only because I am still not perfect in scraping and sometimes have to hack my way out.

We collect links from letsintern in the following way :
1. We go to its landing page and collect the links to all the internship categories.
2. We go to each of these categories and collect the 10-14 links that we can for particular internship profiles. More can't be collected because the page uses a button that has to be clicked to display more links. When the next section of links is displayed, the page stays the same w.r.t. its url. I haven't been able to figure out a way to deal with this as of now.
3. We go to to all the internship profiles and collect information from them. We also collect other internship profile links that have been linked to in that page(the ones being recommended on the side of the page). Note that these collected links point to compay pages where different internship profiles are listed. So after going to each of these links, internship profile links have to be extracted again.

This way we will get enough internship profiles to start our work with.

_Note: Complete links has been used below to refer to entire links. 
eg: www.letsintern.com/internships_

In [1]:
# importing libs
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [21]:
# defining functions
def collect_tag_links():
    '''
    collects all complete links to different intership category pages such as:
    internships in Delhi, Finance internships, Tech internships, internships in chandigarh etc.
    
    OUPUT:
    tag_links - returns all links to different present on www.letsintern.com 
    '''
    tag_links = []
    
    url = requests.get('https://www.letsintern.com/')
    data = url.text
    soup = BeautifulSoup(data)
    
    tags = soup.find_all('li',attrs = {'class':'col-sm-4'})
    for i in tags:
        hrefs = i.find_all('a')
        hrefs = ['http://letsintern.com' + j['href'] for j in hrefs]
        tag_links.extend(hrefs)
        
    return tag_links

def find_links_main(links_list):
    '''
    finds links to specific internship pages from pages where different internship profiles are listed. 
    Eg : www.letsintern.com/interships/IT-internships
    
    INPUT:
    links_list - a list of complete links that refer to different categorial pages
    
    OUPUT:
    collected_links - a list of collected links from all the pages in links_list 
    '''
    collected_links = []
    n = 1
    
    for link in links_list:
        url = requests.get(link)
        data = url.text
        soup = BeautifulSoup(data)
        
        links = soup.find_all('div', attrs = {'class':'job-title'})
        
        collected_links.extend(['http://letsintern.com' + i.a['href'] for i in links if 'letsintern' not in i.a['href']])
        
        print(n/len(links_list))
        n+= 1
    collected_links = list(set(collected_links))
    return collected_links

def find_links_internship(links_list):
    '''
    finds the links present on an internship profile page. These links point to a company 
    page where more links are posted. 
    Also, the links collected will be similar to the internship on that page as these themselves are the 
    recommendations by letsintern.
    
    INPUT:
    links_list - a list of complete links that refer to specific internship pages
    
    OUTPUT:
    collected_links - a list of collected links from all the pages in links_list after removing the 
                      links in the input links_list(ensuring we aren't returning duplicates).
                      
    '''
    collected_links = []
    n = 1
    
    for link in links_list:
        url = requests.get(link)
        data = url.text
        soup = BeautifulSoup(data)
        
        links = soup.find_all('div',attrs={'class':"col-sm-9 col-xs-9"})
        
        collected_links.extend(['http://letsintern.com' + i.a['href'] for i in links])
        
        print(n/len(links_list))
        n+= 1
    collected_links = list(set(collected_links))
    return collected_links

def find_links_company(links_list):
    '''
    finds the links to different internship profiles present on an internship company page. 
    
    INPUT:
    links_list - a list of complete links that refer to specific company pages
    
    OUTPUT:
    collected_links - a list of collected links from all the pages in links_list after removing the 
                      links in links_list.
    '''
    collected_links = []
    n = 1
                           
    for link in links_list:
        url = requests.get(link)
        data = url.text
        soup = BeautifulSoup(data)
                           
        links = soup.find_all('div',attrs = {'class':'job-title'})
                           
        for i in links:
            try:
                collected_links.append('http://letsintern.com'+ i.a['href'])
            except:
                print(i.a)
                continue
        print(n/len(links_list))
        n+= 1
    return list(set(collected_links))

def extract_data(links_list):
    '''
    extracts all the relevant data needed from each of the links and returns and saves a 
    dataframe containing all that information.
    
    INPUT:
    links_list - a list of complete links that refer to specific internship pages
    
    OUTPUT:
    df - a dataframe with the rows as the links and the columns as the information extracted
    
    '''
    job_title = []
    company_name = []
    job_loc = []
    details = []
    category = []
    compensation =[]
    start = []
    end = []
    skills = []
    n = 1
    
    for link in links_list:
        url = requests.get(link)
        data = url.text
        soup = BeautifulSoup(data) 

        try :
            # many job titles were not given as the pages didn't exist themselves
            job_title.append(soup.find_all('div',attrs={'class':'job-title'})[0].text)
        except:
            print(link)
            links_list.pop(links_list.index(link))
            continue # continue breaks the current iteration of the loop and jumps to the next one
        try:                 
            company_name.append(soup.find_all('div', attrs ={'class':'company-name'})[0].text)
        except:
            print(soup.find_all('div', attrs ={'class':'company-name'}))
            company_name.append('no category found')
        try:
            job_loc.append(soup.find_all('div', attrs ={'class':'job-locations'})[0].text)
        except:
            print(soup.find_all('div', attrs ={'class':'job-locations'}))
            job_loc.append('no job location found')
        try:
            details.append(soup.find_all('div', attrs ={'class':'details-section fixht'})[0].text)
        except:
            print(soup.find_all('div', attrs ={'class':'details-section fixht'}))
            details.append('no details found')
        try:  
            category.append(soup.find_all('a', attrs= {'title':'Internship Category'})[0].text)
        except:
            print(soup.find_all('a', attrs= {'title':'Internship Category'}))
            category.append('no company found')
        try:
            compensation.append(soup.find_all('a', attrs= {'title':'Compensation Type'})[0].text)
        except: 
            print(soup.find_all('a', attrs= {'title':'Compensation Type'}))
            compensation.append('no compensation found')
        try:
            start.append(soup.find_all('li', attrs = {'title':'Start Date'})[0].text)
        except:
            print(soup.find_all('li', attrs = {'title':'Start Date'}))
            start.append('no start date found')
        try:    
            end.append(soup.find_all('li', attrs = {'title':'End Date'})[0].text)
        except:
            print(soup.find_all('li', attrs = {'title':'End Date'}))
            end.append('no end date found')
        try:
            skills.append(soup.find_all('div', attrs = {'id':'skills-required'})[0].text)
        except: 
            print(soup.find_all('div', attrs = {'id':'skills-required'}))
            skills.append('no skills found')
        print(n/len(links_list))
        n+=1
    
    df = pd.DataFrame({'href':links_list, 'job_title':job_title, 'company_name':company_name, 'job_loc':job_loc
                      ,'details':details, 'category':category, 'compensation':compensation, 'start':start
                      ,'end':end, 'skills':skills})
    #df.to_csv('../data/information_from_links.csv')
    return df

#### Our approach to this would be as follows:
1. Call `collect_tag_links()` to collect all the main internship listing pages.
2. Call `find_links_main()` on all of these. Save the collected links separately.
3. Call `find_links_internships()` on all of the links collected in the above step. Save these links separately.
4. Call `find_links_company()` on the above links.
5. Take the `list(set())` of the links in 2 and 4 step.

These steps will give us all the links we need to scrape information from

In [3]:
tag_links = collect_tag_links()

In [4]:
saved_links = find_links_main(tag_links)

0.03571428571428571
0.07142857142857142
0.10714285714285714
0.14285714285714285
0.17857142857142858
0.21428571428571427
0.25
0.2857142857142857
0.32142857142857145
0.35714285714285715
0.39285714285714285
0.42857142857142855
0.4642857142857143
0.5
0.5357142857142857
0.5714285714285714
0.6071428571428571
0.6428571428571429
0.6785714285714286
0.7142857142857143
0.75
0.7857142857142857
0.8214285714285714
0.8571428571428571
0.8928571428571429
0.9285714285714286
0.9642857142857143
1.0


In [5]:
saved_links_1 = find_links_internship(saved_links)

0.0064516129032258064
0.012903225806451613
0.01935483870967742
0.025806451612903226
0.03225806451612903
0.03870967741935484
0.04516129032258064
0.05161290322580645
0.05806451612903226
0.06451612903225806
0.07096774193548387
0.07741935483870968
0.08387096774193549
0.09032258064516129
0.0967741935483871
0.1032258064516129
0.10967741935483871
0.11612903225806452
0.12258064516129032
0.12903225806451613
0.13548387096774195
0.14193548387096774
0.14838709677419354
0.15483870967741936
0.16129032258064516
0.16774193548387098
0.17419354838709677
0.18064516129032257
0.1870967741935484
0.1935483870967742
0.2
0.2064516129032258
0.2129032258064516
0.21935483870967742
0.22580645161290322
0.23225806451612904
0.23870967741935484
0.24516129032258063
0.25161290322580643
0.25806451612903225
0.2645161290322581
0.2709677419354839
0.27741935483870966
0.2838709677419355
0.2903225806451613
0.2967741935483871
0.3032258064516129
0.3096774193548387
0.3161290322580645
0.3225806451612903
0.32903225806451614
0.33548

In [8]:
saved_links_2 = find_links_company(saved_links_1)

0.002785515320334262
0.005571030640668524
0.008356545961002786
0.011142061281337047
0.013927576601671309
0.016713091922005572
0.019498607242339833
0.022284122562674095
0.025069637883008356
0.027855153203342618
0.03064066852367688
0.033426183844011144
0.036211699164345405
0.03899721448467967
0.04178272980501393
0.04456824512534819
0.04735376044568245
0.05013927576601671
0.052924791086350974
0.055710306406685235
0.0584958217270195
0.06128133704735376
0.06406685236768803
0.06685236768802229
0.06963788300835655
0.07242339832869081
0.07520891364902507
0.07799442896935933
0.0807799442896936
0.08356545961002786
0.08635097493036212
0.08913649025069638
0.09192200557103064
0.0947075208913649
0.09749303621169916
0.10027855153203342
0.10306406685236769
0.10584958217270195
0.10863509749303621
0.11142061281337047
0.11420612813370473
0.116991643454039
0.11977715877437325
0.12256267409470752
0.12534818941504178
0.12813370473537605
0.1309192200557103
0.13370473537604458
0.13649025069637882
0.1392757660

0.9805013927576601
0.9832869080779945
0.9860724233983287
0.9888579387186629
0.9916434540389972
0.9944289693593314
0.9972144846796658
1.0


In [9]:
final_links = list(set(saved_links_2) | set(saved_links))

In [10]:
len(final_links)

700

We had 665 unique links when I tried it earlier. Thats the data I have worked on.

_Note: this number can change depending on when you run it because the website keeps getting updated with new internships_

In [20]:
# extract the data from all the links
df = extract_data(final_links)

0.00145985401459854
0.00291970802919708
0.004379562043795621
0.00583941605839416
0.0072992700729927005
0.008759124087591242
0.010218978102189781
0.01167883211678832
0.013138686131386862
0.014598540145985401
0.016058394160583942
0.017518248175182483
0.01897810218978102
0.020437956204379562
0.021897810218978103
0.02335766423357664
0.024817518248175182
0.026277372262773723
0.027737226277372264
0.029197080291970802
0.030656934306569343
0.032116788321167884
0.033576642335766425
0.035036496350364967
0.0364963503649635
0.03795620437956204
0.03941605839416058
0.040875912408759124
0.042335766423357665
0.043795620437956206
0.04525547445255475
0.04671532846715328
0.04817518248175182
0.049635036496350364
0.051094890510948905
0.052554744525547446
0.05401459854014599
0.05547445255474453
0.05693430656934306
0.058394160583941604
0.059854014598540145
0.061313868613138686
0.06277372262773723
0.06423357664233577
0.06569343065693431
0.06715328467153285
0.06861313868613139
0.07007299270072993
0.07153284671

0.6160583941605839
0.6175182481751825
0.618978102189781
0.6204379562043796
0.621897810218978
0.6233576642335766
0.6248175182481752
0.6262773722627737
0.6277372262773723
0.6291970802919709
0.6306569343065693
0.6321167883211679
0.6335766423357664
0.635036496350365
0.6364963503649635
0.637956204379562
0.6394160583941606
0.6408759124087591
0.6423357664233577
0.6437956204379562
0.6452554744525547
0.6467153284671533
0.6481751824817519
0.6496350364963503
0.6510948905109489
0.6525547445255474
0.654014598540146
0.6554744525547446
0.656934306569343
0.6583941605839416
0.6598540145985401
0.6613138686131387
0.6627737226277373
0.6642335766423357
0.6656934306569343
0.6671532846715329
0.6686131386861314
0.67007299270073
0.6715328467153284
0.672992700729927
0.6744525547445256
0.6759124087591241
0.6773722627737226
0.6788321167883211
0.6802919708029197
0.6817518248175183
0.6832116788321168
0.6846715328467153
0.6861313868613139
0.6875912408759124
0.689051094890511
0.6905109489051094
0.691970802919708
0.69