_Note: Complete links has been used below to refer to entire links. 
eg: www.letsintern.com/internships_

For getting our data we will first collect links from the website of letsintern. Then, we will go to each of these links and for each one, we will extract the below information and store it in lists. Then we will make a dataframe out of all the links and save that dataframe which will be later used to make the recommendation sys. 

The following features are needed : 

* Internship title
* Company name
* tagged profile
* start and end date
* Compensation
* Skills required
* About internship/ role and responsibilities
* Location

Also, note that the way links have been extracted is a lil weird or different if I may say. That is only because I am still not perfect in scraping and sometimes have to hack my out.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [11]:
def collect_tag_links():
    '''
    use this function to collect all complete links to different intership category pages such as:
    internships in Delhi, Finance internships, Tech internships, internships in chandigarh etc.
    
    OUPUT:
    tag_links - returns all of the categorical links present on www.letsintern.com page
    '''
    tag_links = []
    url = requests.get('https://www.letsintern.com/')
    data = url.text
    soup = BeautifulSoup(data)
    tags = soup.find_all('li',attrs = {'class':'col-sm-4'})
    for i in tags:
        hrefs = i.find_all('a')
        hrefs = ['http://letsintern.com' + j['href'] for j in hrefs]
        tag_links.extend(hrefs)
    return tag_links

In [91]:
def find_links_main(links_list):
    '''
    this function finds links to specific internship pages from different categorical pages or pages
    where different internships are listed. Eg : www.letsintern.com/interships/IT-internships
    
    INPUT:
    links_list - a list of complete links that refer to different categorial pages
    
    OUPUT:
    collected_links - a list of collected links from all the pages in links_list 
    '''
    collected_links = []
    n = 1
    for link in links_list:
        url = requests.get(link)
        data = url.text
        soup = BeautifulSoup(data)
        links = soup.find_all('div', attrs = {'class':'job-title'})
        collected_links.extend(['http://letsintern.com' + i.a['href'] for i in links if 'letsintern' not in i.a['href']])
        print(n/len(links_list))
        n+= 1
    collected_links = list(set(collected_links))
    return collected_links

In [58]:
def find_links_internship(links_list):
    '''
    this function finds the links present on an internship profile page. These links point to a company 
    page where more links are posted. 
    Also, The links collected will be similar to the internship on that page as these themselves are the 
    recommendations by letsintern.
    
    INPUT:
    links_list - a list of complete links that refer to specific internship pages
    
    OUTPUT:
    collected_links - a list of collected links from all the pages in links_list after removing the 
                      links in links_list.
    '''
    collected_links = []
    n = 1
    for link in links_list:
        url = requests.get(link)
        data = url.text
        soup = BeautifulSoup(data)
        links = soup.find_all('div',attrs={'class':"col-sm-9 col-xs-9"})
        collected_links.extend(['http://letsintern.com' + i.a['href'] for i in links])
        print(n/len(links_list))
        n+= 1
    collected_links = list(set(collected_links)
    return collected_links

In [179]:
def find_links_company(links_list):
    '''
    this function finds the links present on an internship company page. 
    
    INPUT:
    links_list - a list of complete links that refer to specific company pages
    
    OUTPUT:
    collected_links - a list of collected links from all the pages in links_list after removing the 
                      links in links_list.
    '''
    collected_links = []
    n = 1
    for link in links_list:
        url = requests.get(link)
        data = url.text
        soup = BeautifulSoup(data)
        links = soup.find_all('div',attrs = {'class':'job-title'})
        collected_links.extend(['http://letsintern.com'+ i.a['href']for i in links])
        print(n/len(links_list))
        n+= 1
    return list(set(collected_links))

In [264]:
def extract_data(links_list):
    '''
    this function extracts all the relevant data needed from each of the links and returns and saves a 
    dataframe containing all that information
    
    INPUT:
    links_list - a list of complete links that refer to specific internship pages
    
    OUTPUT:
    df - a dataframe with the rows as the links and the columns as the information extracted
    
    '''
    job_title = []
    company_name = []
    job_loc = []
    details = []
    category = []
    compensation =[]
    start = []
    end = []
    skills = []
    n = 1
    
    for link in links_list:
        url = requests.get(link)
        data = url.text
        soup = BeautifulSoup(data) 

        try :
            job_title.append(soup.find_all('div',attrs={'class':'job-title'})[0].text)
        except:
            print(link)
            links_list.pop(links_list.index(link))
            continue
        company_name.append(soup.find_all('div', attrs ={'class':'company-name'})[0].text)

        job_loc.append(soup.find_all('div', attrs ={'class':'job-locations'})[0].text)

        details.append(soup.find_all('div', attrs ={'class':'details-section fixht'})[0].text)

        category.append(soup.find_all('a', attrs= {'title':'Internship Category'})[0].text)

        compensation.append(soup.find_all('a', attrs= {'title':'Compensation Type'})[0].text)

        start.append(soup.find_all('li', attrs = {'title':'Start Date'})[0].text)

        end.append(soup.find_all('li', attrs = {'title':'End Date'})[0].text)

        skills.append(soup.find_all('div', attrs = {'id':'skills-required'})[0].text)
        
        print(n/len(links_list))
        n+=1
        
    df = pd.DataFrame({'href':links_list, 'job_title':job_title, 'company_name':company_name, 'job_loc':job_loc
                      ,'details':details, 'category':category, 'compensation':compensation, 'start':start
                      ,'end':end, 'skills':skills})
    df.to_csv('../data/information_from_links.csv')
    return df

#### Our approach to this would be as follows:
1. Call `collect_tag_links()` to collect all the main internship listing pages.
2. Call `find_links_main()` on all of these. Save the collected links separately.
3. Call `find_links_internships()` on all of the links collected in the above step. Save these links separately.
4. Call `find_links_company()` on the above links.
5. Take the `list(set())` of the links in 2 and 4 step.

These steps will give us all the links we need to scrape information from

In [94]:
tag_links = collect_tag_links()

In [95]:
saved_links = find_links_main(tag_links)

0.03571428571428571
0.07142857142857142
0.10714285714285714
0.14285714285714285
0.17857142857142858
0.21428571428571427
0.25
0.2857142857142857
0.32142857142857145
0.35714285714285715
0.39285714285714285
0.42857142857142855
0.4642857142857143
0.5
0.5357142857142857
0.5714285714285714
0.6071428571428571
0.6428571428571429
0.6785714285714286
0.7142857142857143
0.75
0.7857142857142857
0.8214285714285714
0.8571428571428571
0.8928571428571429
0.9285714285714286
0.9642857142857143
1.0


In [100]:
saved_links_1 = find_links_internship(saved_links)

0.006622516556291391
0.013245033112582781
0.019867549668874173
0.026490066225165563
0.033112582781456956
0.039735099337748346
0.046357615894039736
0.052980132450331126
0.059602649006622516
0.06622516556291391
0.0728476821192053
0.07947019867549669
0.08609271523178808
0.09271523178807947
0.09933774834437085
0.10596026490066225
0.11258278145695365
0.11920529801324503
0.12582781456953643
0.13245033112582782
0.1390728476821192
0.1456953642384106
0.152317880794702
0.15894039735099338
0.16556291390728478
0.17218543046357615
0.17880794701986755
0.18543046357615894
0.19205298013245034
0.1986754966887417
0.2052980132450331
0.2119205298013245
0.2185430463576159
0.2251655629139073
0.23178807947019867
0.23841059602649006
0.24503311258278146
0.25165562913907286
0.2582781456953642
0.26490066225165565
0.271523178807947
0.2781456953642384
0.2847682119205298
0.2913907284768212
0.2980132450331126
0.304635761589404
0.31125827814569534
0.31788079470198677
0.32450331125827814
0.33112582781456956
0.33774834

In [162]:
saved_links_2 = find_links_company(saved_links_1)

0.0028653295128939827
0.0057306590257879654
0.008595988538681949
0.011461318051575931
0.014326647564469915
0.017191977077363897
0.02005730659025788
0.022922636103151862
0.025787965616045846
0.02865329512893983
0.03151862464183381
0.034383954154727794
0.03724928366762178
0.04011461318051576
0.04297994269340974
0.045845272206303724
0.04871060171919771
0.05157593123209169
0.054441260744985676
0.05730659025787966
0.06017191977077364
0.06303724928366762
0.0659025787965616
0.06876790830945559
0.07163323782234957
0.07449856733524356
0.07736389684813753
0.08022922636103152
0.0830945558739255
0.08595988538681948
0.08882521489971347
0.09169054441260745
0.09455587392550144
0.09742120343839542
0.10028653295128939
0.10315186246418338
0.10601719197707736
0.10888252148997135
0.11174785100286533
0.11461318051575932
0.1174785100286533
0.12034383954154727
0.12320916905444126
0.12607449856733524
0.12893982808022922
0.1318051575931232
0.1346704871060172
0.13753581661891118
0.14040114613180515
0.1432664756

In [175]:
final_links = list(set(saved_links_2) | set(saved_links))

In [176]:
len(final_links)

665

Thus we have 665 unique links with us from where we will get our needed information

In [265]:
df = extract_data(final_links)

0.0015313935681470138
0.0030627871362940277
0.004594180704441042
0.006125574272588055
0.007656967840735069
0.009188361408882083
0.010719754977029096
0.01225114854517611
0.013782542113323124
0.015313935681470138
0.016845329249617153
0.018376722817764167
0.019908116385911178
0.021439509954058193
0.022970903522205207
0.02450229709035222
0.026033690658499236
0.027565084226646247
0.02909647779479326
0.030627871362940276
0.03215926493108729
0.033690658499234305
0.03522205206738132
0.036753445635528334
0.03828483920367534
0.039816232771822356
0.04134762633996937
0.042879019908116385
0.0444104134762634
0.045941807044410414
0.04747320061255743
0.04900459418070444
0.05053598774885146
0.05206738131699847
0.05359877488514548
0.055130168453292494
0.05666156202143951
0.05819295558958652
0.05972434915773354
0.06125574272588055
0.06278713629402756
0.06431852986217458
0.06584992343032159
0.06738131699846861
0.06891271056661562
0.07044410413476264
0.07197549770290965
0.07350689127105667
0.07503828483920

0.6431852986217458
0.6447166921898928
0.6462480857580398
0.6477794793261868
0.6493108728943339
0.6508422664624809
0.6523736600306279
0.6539050535987749
0.655436447166922
0.6569678407350689
0.6584992343032159
0.6600306278713629
0.6615620214395099
0.663093415007657
0.664624808575804
0.666156202143951
0.667687595712098
0.669218989280245
0.6707503828483921
0.6722817764165391
0.6738131699846861
0.6753445635528331
0.6768759571209801
0.678407350689127
0.6799387442572741
0.6814701378254211
0.6830015313935681
0.6845329249617151
0.6860643185298622
0.6875957120980092
0.6891271056661562
0.6906584992343032
0.6921898928024502
0.6937212863705973
0.6952526799387443
0.6967840735068913
0.6983154670750383
0.6998468606431854
0.7013782542113323
0.7029096477794793
0.7044410413476263
0.7059724349157733
0.7075038284839203
0.7090352220520674
0.7105666156202144
0.7120980091883614
0.7136294027565084
0.7151607963246555
0.7166921898928025
0.7182235834609495
0.7197549770290965
0.7212863705972435
0.7228177641653905
