<h1>Web Scraping Indeed for Key Data Science Job Skills </h1>

# Cleaning a Website

In [5]:
from bs4 import BeautifulSoup # For HTML parsing
import urllib.request # Website connections
import re # Regular expressions
from time import sleep # To prevent overwhelming the server between connections
from collections import Counter # Keep track of our term counts
from nltk.corpus import stopwords # Filter out stopwords, such as 'the', 'or', 'and'
import pandas as pd # For converting results to a dataframe and bar chart plots
%matplotlib inline

In [6]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\amey.naik\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

Now create our first website parsing function

In [7]:
def extract_job_title(job_div, job_post):    
    for title in job_div.find(name="h3", attrs={"class":"jobsearch-JobInfoHeader-title"}):
        if isinstance(title, bs4.element.NavigableString):
            job_post.append(title.text.strip())

In [8]:
def extract_company(job_div, job_post):    
    for company in job_div.find(name="h4", attrs={"class":"jobsearch-CompanyReview--heading"}):
        if isinstance(company, bs4.element.NavigableString):
            job_post.append(company.text.strip())

In [9]:
def text_cleaner(website, job_post):
    '''
    This function just cleans up the raw html so that I can look at it.
    Inputs: a URL to investigate
    Outputs: Cleaned text only
    '''
    try:
        site = urllib.request.urlopen(website).read() # Connect to the job posting
    except: 
        return   # Need this in case the website isn't there anymore or some other weird connection problem 
    
    soup_obj = BeautifulSoup(site) # Get the html from the site
    
    for script in soup_obj(["script", "style"]):
        script.extract() # Remove these two elements from the BS4 object
    
    extract_job_title(soup_obj, job_post)
    extract_company(soup_obj, job_post)

    text = soup_obj.get_text() # Get the text from this
    
        
    
    lines = (line.strip() for line in text.splitlines()) # break into lines
    
        
        
    chunks = (phrase.strip() for line in lines for phrase in line.split("  ")) # break multi-headlines into a line each
    
    def chunk_space(chunk):
        chunk_out = chunk + ' ' # Need to fix spacing issue
        return chunk_out  
        
    
    text = ''.join(chunk_space(chunk) for chunk in chunks if chunk).encode('utf-8') # Get rid of all blank lines and ends of line
        
        
    # Now clean out all of the unicode junk (this line works great!!!)
        
    try:
        text = text.decode('unicode_escape').encode('ascii', 'ignore') # Need this as some websites aren't formatted
    except:                                                            # in a way that this works, can occasionally throw
        return                                                         # an exception
       
        
    text = re.sub("[^a-zA-Z.+3]"," ", text.decode('utf-8'))  # Now get rid of any terms that aren't words (include 3 for d3.js)
                                                # Also include + for C++
        
       
    text = text.lower().split()  # Go to lower case and split them apart
        
        
    stop_words = set(stopwords.words("english")) # Filter out any stop words
    text = [w for w in text if not w in stop_words]
        
               
    text = list(set(text)) # Last, just get the set of these. Ignore counts (we are just looking at whether a term existed
                            # or not on the website)
    job_post.append(text)    
    return text

In [8]:
sample = text_cleaner('https://www.indeed.co.in/viewjob?jk=76a15d792ddb6cc0&tk=1ebvkckfp154h000&from=serp&vjs=3')
print (sample[:20]) 

['centercookies', 'major', 'python', 'existence', 'post', 'indeedabouthelp', 'realm', 'numpy', 'hbase.should', 'l', 'tools', 'diverse', 'sustainable', 'workforce.', 'credit', 'mandatory.proficiency', 'sengage', 'missions', 'ou', 'entreprise.']


As you can see in the code above, a lot of cleaning for the raw html is necessary to get the final terms we are looking for. It extracts the relevant portions of the html, gets the text, removes blank lines and line endings, removes unicode, and filters with regular expressions to include only words. To see what the final result looks like, let’s try calling this function on a sample job posting. 

### Get job title

In [2]:
def extract_job_title(job_div):    
    for title in job_div.find(name="h3", attrs={"class":"jobsearch-JobInfoHeader-title"}):
        if isinstance(title, bs4.element.NavigableString):
            job_post.append(title.text.strip())

### Get company name

In [1]:
def extract_company(job_div):    
    for company in job_div.find(name="h4", attrs={"class":"jobsearch-CompanyReview--heading"}):
        if isinstance(company, bs4.element.NavigableString):
            job_post.append(title.text.strip())

# Accessing the Job Postings

In [9]:
def skills_info_ind(city = None, state = None):
    '''
    This function will take a desired city/state and look for all new job postings
    on Indeed.com. It will crawl all of the job postings and keep track of how many
    use a preset list of typical data science skills. The final percentage for each skill
    is then displayed at the end of the collation. 
        
    Inputs: The location's city and state. These are optional. If no city/state is input, 
    the function will assume a national search (this can take a while!!!).
    Input the city/state as strings, such as skills_info('Chicago', 'IL').
    Use a two letter abbreviation for the state.
    
    Output: A bar chart showing the most commonly desired skills in the job market for 
    a data scientist. 
    '''
        
    final_job = 'data+scientist' # searching for data scientist exact fit("data scientist" on Indeed search)
    
    # Make sure the city specified works properly if it has more than one word (such as San Francisco)
    if city is not None:
        final_city = city.split() 
        final_city = '+'.join(word for word in final_city)
        final_site_list = ['http://www.indeed.co.in/jobs?q=%22', final_job, '%22&l=', final_city,
                    '%2C+', state] # Join all of our strings together so that indeed will search correctly
    else:
        final_site_list = ['http://www.indeed.co.in/jobs?q="', final_job, '"']

    final_site = ''.join(final_site_list) # Merge the html address together into one string
    print (final_site)
    
    base_url = 'http://www.indeed.co.in'
    
    
    try:
        html = urllib.request.urlopen(final_site).read() # Open up the front page of our search first
    except:
        'That city/state combination did not have any jobs. Exiting . . .' # In case the city is invalid
        return
    soup = BeautifulSoup(html) # Get the html from the first page
    
    # Now find out how many jobs there were
    
    num_jobs_area = soup.find(id = 'searchCountPages').string.encode('utf-8') # Now extract the total number of jobs found
                                                                        # The 'searchCount' object has this
    print (num_jobs_area)
    job_numbers = re.findall('\d+', num_jobs_area.decode('utf-8')) # Extract the total jobs found from the search result
    print (job_numbers)
    
    if len(job_numbers) > 1: # Have a total number of jobs greater than 1000
        total_num_jobs = (int(job_numbers[1])*10)
    else:
        total_num_jobs = int(job_numbers[0]) 
    
    city_title = city
    if city is None:
        city_title = 'Nationwide'
        
    print ('There were', total_num_jobs, 'jobs found,', city_title) # Display how many jobs were found
    
    num_pages = total_num_jobs/10 # This will be how we know the number of times we need to iterate over each new
                                      # search result page
    job_descriptions = [] # Store all our descriptions in this list
    
    columns = ["job_title", "company_name", "description"]
    sample_df = pd.DataFrame(columns = columns)
    
    for i in range(1,int(num_pages)+1): # Loop through all of our search result pages
        print ('Getting page', i)
                
        start_num = str(i*10) # Assign the multiplier of 10 to view the pages we want
        current_page = ''.join([final_site, '&start=', start_num])
        # Now that we can view the correct 10 job returns, start collecting the text samples from each
            
        html_page = urllib.request.urlopen(current_page).read() # Get the page
            
        page_obj = BeautifulSoup(html_page) # Locate all of the job links
        job_link_area = page_obj.find(id = 'resultsCol') # The center column on the page where the job postings exist
            
        
        for link in job_link_area.find_all('a'):
            if link.get('href') is not None:
                job_URLS = [base_url + link.get('href')]
         
        job_URLS = [base_url + link.get('href') for link in job_link_area.find_all('a') if link.get('href') is not None ] # Get the URLS for the jobs
        job_URLS = list(filter(lambda x:'clk' in x, job_URLS)) # Now get just the job related URLS
        print (job_URLS)    
        
        for j in range(0,len(job_URLS)):
            #specifying row num for index of job posting in dataframe
            num = (len(sample_df) + 1)
            #creating an empty list to hold the data for each posting
            job_post = []
            
            final_description = text_cleaner(job_URLS[j], job_post)
            sample_df.loc[num] = job_post
            if final_description: # So that we only append when the website was accessed correctly
                job_descriptions.append(final_description)
            sleep(1) #If you have a very fast internet connection you could hit the server a lot! 
    print (job_descriptions)    
    print ('Done with collecting the job postings!')    
    print ('There were', len(job_descriptions), 'jobs successfully found.')
    sample_df.to_csv("job_scrapping_indeed.csv", encoding="utf-8")
    
    doc_frequency = Counter() # This will create a full counter of our terms. 
    [doc_frequency.update(item) for item in job_descriptions] # List comp
    
    # Now we can just look at our final dict list inside doc_frequency
    
    # Obtain our key terms and store them in a dict. These are the key data science skills we are looking for
    
    prog_lang_dict = Counter({'R':doc_frequency['r'], 'Python':doc_frequency['python'],
                    'Java':doc_frequency['java'], 'C++':doc_frequency['c++'],
                    'Ruby':doc_frequency['ruby'],
                    'Perl':doc_frequency['perl'], 'Matlab':doc_frequency['matlab'],
                    'JavaScript':doc_frequency['javascript'], 'Scala': doc_frequency['scala']})
                      
    analysis_tool_dict = Counter({'Excel':doc_frequency['excel'],  'Tableau':doc_frequency['tableau'],
                        'D3.js':doc_frequency['d3.js'], 'SAS':doc_frequency['sas'],
                        'SPSS':doc_frequency['spss'], 'D3':doc_frequency['d3']})  

    hadoop_dict = Counter({'Hadoop':doc_frequency['hadoop'], 'MapReduce':doc_frequency['mapreduce'],
                'Spark':doc_frequency['spark'], 'Pig':doc_frequency['pig'],
                'Hive':doc_frequency['hive'], 'Shark':doc_frequency['shark'],
                'Oozie':doc_frequency['oozie'], 'ZooKeeper':doc_frequency['zookeeper'],
                'Flume':doc_frequency['flume'], 'Mahout':doc_frequency['mahout']})
                
    database_dict = Counter({'SQL':doc_frequency['sql'], 'NoSQL':doc_frequency['nosql'],
                    'HBase':doc_frequency['hbase'], 'Cassandra':doc_frequency['cassandra'],
                    'MongoDB':doc_frequency['mongodb']})
    
    skill_dict = Counter({'Analytics':doc_frequency['analytics'], 'Hadoop':doc_frequency['hadoop'],'ML':doc_frequency['machine'],
                    'Data Mining':doc_frequency['mining'], 'Visualization':doc_frequency['visualization'],'NLP':doc_frequency['language'],'Computer Vision':doc_frequency['vision'],'Deep Learning':doc_frequency['deep'],'Graph Engine':doc_frequency['graph'] })

    domain_dict = Counter({'Aerospace':doc_frequency['aerospace'], 'Automobile':doc_frequency['automobile'],
                    'Banking':doc_frequency['banking'], 'FMCG':doc_frequency['fmcg'],
                    'Retail':doc_frequency['retail'],'Travel':doc_frequency['travel'],'Telecom':doc_frequency['telecom'],'Healthcare':doc_frequency['healthcare'], 'Energy':doc_frequency['energy'],'Education':doc_frequency['education']})                 
               
    #overall_total_skills = prog_lang_dict + analysis_tool_dict + database_dict + skill_dict # Combine our Counter objects
    overall_total_skills = domain_dict # Combine our Counter objects
    print (overall_total_skills)    
    
    columns = ['Term', 'NumPostings']
    final_frame = pd.DataFrame.from_dict(overall_total_skills,orient='index').reset_index() # Convert these terms to a 
                                                                                                # dataframe 
    final_frame = final_frame.rename(columns={'index':'Term', 0:'NumPostings'})
    
    # Change the values to reflect a percentage of the postings 
    
    print (final_frame)
    final_frame.NumPostings = (final_frame.NumPostings)*100/len(job_descriptions) # Gives percentage of job postings 
                                                                                    #  having that term 
    
    # Sort the data for plotting purposes
    
    final_frame.sort_values(by = 'NumPostings', ascending = False, inplace = True)
    
    # Get it ready for a bar plot
    print ('****')    
    print (final_frame)
    print ('Sai')
    state = "Silicon Valley"
    final_plot = final_frame.plot(x = 'Term', kind = 'bar', legend = None, 
                            title = 'Percentage of Data Scientist Job Ads with a Key Skill, ' + state)
        
    final_plot.set_ylabel('Percentage Appearing in Job Ads')
    fig = final_plot.get_figure() # Have to convert the pandas plot object to a matplotlib object
    jd = pd.DataFrame(job_descriptions)    
        
    return jd # End of the function

In [10]:
job_info = skills_info_ind(city = "Hyderabad", state = "Telangana") 

http://www.indeed.co.in/jobs?q=%22data+scientist%22&l=Hyderabad%2C+Telangana
b'\n                    Page 1 of 233 jobs'
['1', '233']
There were 2330 jobs found, Hyderabad
Getting page 1
['http://www.indeed.co.in/rc/clk?jk=de42294fd77fd751&fccid=b3b99a1c02370545&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=36851701560aacbd&fccid=0e04d23687478ad9&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=67e80514a1b944a2&fccid=9536dde6bb34eec9&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=b0ba494080022f2d&fccid=03b99c5308afc4a9&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=9a9743e70c2e8cc2&fccid=9e215d88a6b33622&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=b22c4aa482cec7c5&fccid=0cdd67e1391d9490&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=a444b30a99d3689e&fccid=633f4ab0397bd53d&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=15900d5dd3058c97&fccid=11d9243527c6025a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=d2ec97a887f4d563&fccid=8bd5ab631f7f9ef3&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=97a6847bb0dad81f&fccid=2b07479

Getting page 7
['http://www.indeed.co.in/pagead/clk?mo=r&ad=-6NYlbfkN0AsDgJcjf3QstaB-NhPfcDvitvdOui-gURn6hD2sXcTqCNf98SJ_tS8y2uKQcqVtkG4THHLts8G-Sv3JS-5glwH6IyuIJ8s5HUzZT3oTGAh8WNulgEN34hST4xhxLNIfmsqP2FxBhlSh9t5d1poFqnxg77lFrevQtKfgULnJ4Cp_AaKwI3iNZ6CBPSXLtCdtliu78C6MTYQfNUR_TDmjD33efQZkUs2NfoBdOziuhDK81slXXXDGfIh2FSotldQxacCZxnW1s88DD_WDTvv0O9xASCY7ND1x0jxlBVRsyiO_xIwaHU8ukaUVGLVy1LXEKayEajeWO3iMP_Hzz3jOlwbcF2KuaGEznqbBstgX5vX1aDafjdBmIZov7fQo--IbZAd0iJ_c4AD4OuPWIhm1nTMaBZyhVBzXJgNWRZZ8gOXq96ucpwZ-TkSFS226SJSFQoJ93nYZy8VBv1ocuXux59r7UNaE-l3v8qjFhG4bGso-bCiI3Vs9nlshcTfQgzNMgVALO4bauY1meuTlu97W87cKw8QdUeruZHNULbuSsf5vraOgww4FBakb9qcZ70U9-kZpYzQrYUMZGWnNuCWxOv9PjirjDg4x0tm6MU1_4cuUusFcrx3GL2KmPWoQxXVKq7unTCaZ8XjcGclykQKWIztCx6y6Nvl4lzl4E85rx9q-pVQyu4XeiDJQW0ErE4H1kY=&p=0&fvj=0&vjs=3', 'http://www.indeed.co.in/pagead/clk?mo=r&ad=-6NYlbfkN0DGWbMxlqSnT1fiijd83xEf-YAaxqSlo08xiEbCyqV61bonpf0d4kx_xw6JXJ0bRVNGo72Wl38IOvKUaunRlrjN3-dkwyz1hW-lvt1pZ8YK_ANDPsnbEua-eykTVRuA4t9eDv-X63ilEiXighwJCeo_I

Getting page 10
['http://www.indeed.co.in/pagead/clk?mo=r&ad=-6NYlbfkN0DGWbMxlqSnT1fiijd83xEf-YAaxqSlo08xiEbCyqV61bonpf0d4kx_xw6JXJ0bRVNGo72Wl38IOvKUaunRlrjN3-dkwyz1hW-lvt1pZ8YK_ANDPsnbEua-eykTVRuA4t9eDv-X63ilEiXighwJCeo_Ixa3SAfc1NYCHpmBocr3u1iB459MTvlQ8Unmxyw-gT8vXxK2v9Mnzp-nA0Q8LYZjsMshSt53QiDgeFKqgB0StCmslrk-vdsInH0Y1jBe2vMDPzFTfl7uYxhIjkCVO64AKASD9Ljgl_hjlQ4c0ZolDqEO_DDU8buUHcNkt04TxRHV2-OkavZ6uq5X11y-Ax5s8lNgU_zxLOV14nX5kDa_X9-YUYr_WtMmA-cMnhI3CguMR29j5OHNgKJd7uNMdrHVTvL6sAnc65aGI3KtLcr49-8zDPKKg5rqsxXTWREVyS8Jf-kuJYYoYRjTBW9WRECp2-Elm52MfF-oeu7axjVj0kZmvDLuFsdB0KZHnA7zBhuMwIjSsjmt-5DNiaRVaxtrnpiHM3gHY5p2YUuhhIifGR2uhOr0_i0LXDYuNtTLqnANSfCPCxtD-XuZunFBz6OBQO3rawxgDThz3e1rd6WC6YpUSH-D9gO2tm-Hg2_GqEAxVgrqSf0fBs9flTstnmJDCTahiOsw0GmTqcXet4ek50vaN7DoVA_EsB5Z67N5zVEo199EBlKbkVD3yWBJccbg_xr1uW1Gz_ekLMwWDwf_YEvkwDS44Y0W4BH5KAhw6p_npGnMx7I3wfcMF4oNWdPjHnVqpI_qirAgP1M8sxUBiDT7pPTGxHOdW5WWNXGjvNhHafBzKwQiCWlARgfcJduFPQjHBM5tbhF0t3xLZ07RVECvEoAGh52TBNsPBp4Fa0WFjNwHXQ8ZM-F1u9G0m6LliCUMvLv7pkK

Getting page 12
['http://www.indeed.co.in/rc/clk?jk=99a0f030d373ec54&fccid=cd1c087ed0fcb7d8&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=b9c0ebda601461c5&fccid=21df030fae150acc&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=f4abbbfc2c0b08e6&fccid=d3d3520998346837&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=eb19692d183dd182&fccid=59b5ebe2be23b284&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=3e060ff4255bc3d5&fccid=fe2d21eef233e94a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=1398147ffaaf489e&fccid=fe2d21eef233e94a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=9432e386f38dcc7c&fccid=fe2d21eef233e94a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=42d486e850cdec94&fccid=e0efd5c38293090b&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=eb688aa3261ea8b5&fccid=dd616958bd9ddc12&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=4a94dd89a9e4be65&fccid=7ab93833a090100a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=0d18908418825b55&fccid=5d3ed74ca4598964&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=ac9bb1ca7d99f182&fccid=b1e0f66da5c9

Getting page 16
['http://www.indeed.co.in/rc/clk?jk=89792f85ebd84f3b&fccid=d887c830351bb4f7&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=a3727f958a395220&fccid=c35194d01a9e2595&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=b26275c56bdcec00&fccid=024f492ec424f077&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=a9e50e47c46ae24b&fccid=be240c643a8631c5&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=516ae06f9806cb23&fccid=8d25321c1defe73f&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=2e7ebc4118084985&fccid=a0d14db4facebe59&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=eb688aa3261ea8b5&fccid=dd616958bd9ddc12&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=d17839b8e29287a6&fccid=810ca238be51ae86&vjs=3', 'http://www.indeed.co.in/pagead/clk?mo=r&ad=-6NYlbfkN0DGWbMxlqSnT1fiijd83xEf-YAaxqSlo08xiEbCyqV61bonpf0d4kx_xw6JXJ0bRVNGo72Wl38IOvKUaunRlrjN3-dkwyz1hW-lvt1pZ8YK_ANDPsnbEua-eykTVRuA4t9eDv-X63ilEiXighwJCeo_Ixa3SAfc1Na55w5TIcko20nfuK6zhxYArOAPQFHbt3qiSvKv2P690PHmYFMoq35ICXQxUNbuWxPxrCegrDRNwebQ0YkMM_Ljet-0mzWI7AEy8pbv-OQ

Getting page 19
['http://www.indeed.co.in/rc/clk?jk=6816c623a400fdad&fccid=734cb5a01ee60f80&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=d952eec5fb55ca34&fccid=b1e0f66da5c9df23&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=a8d41e03932f4f79&fccid=b1e0f66da5c9df23&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=0d67b437b8c93106&fccid=17ee461a9a7407a9&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=b8a1b25e7807993d&fccid=fe2d21eef233e94a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=efce8ed0257ddff9&fccid=1c5cfc525b841c32&vjs=3', 'http://www.indeed.co.in/pagead/clk?mo=r&ad=-6NYlbfkN0AsDgJcjf3QstaB-NhPfcDvitvdOui-gURn6hD2sXcTqCNf98SJ_tS8y2uKQcqVtkHP2y49q9jUGnXDCGJkrAvh_LoxyJlovUms0KGCO9foBtAmE8ehqw9y6sGuFj01v8nFKPA4eJfKSoF9MSeq3tqBBib90nh4L_FzEcvbUgu_8SIVv6oNquEz_CF6xcLOKWum1qeLGMkjxWLVLdfQ-5_181UhVXZy7V7xZ4qUhojHnSGpumwOHyNprW9djLPp1j80oArQEBNVDmFSyxl3CMyGydDzLbuWTuEdgpFYFAuMYmeU2ngM29JaTIHMTeJ9YvlfRelg9wtKuAVoewA30SHeu2gbirKZwDNcrYmbtcsLTJz2zsY4Y1hl7W9larnA_PJt860h4kpq9wsccRcxIZcMzjFmDqpwqOYST_1ZMkk6UbfkP

Getting page 26
['http://www.indeed.co.in/rc/clk?jk=eb19692d183dd182&fccid=59b5ebe2be23b284&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=9432e386f38dcc7c&fccid=fe2d21eef233e94a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=42d486e850cdec94&fccid=e0efd5c38293090b&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=4a94dd89a9e4be65&fccid=7ab93833a090100a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=eb688aa3261ea8b5&fccid=dd616958bd9ddc12&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=0d18908418825b55&fccid=5d3ed74ca4598964&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=ac9bb1ca7d99f182&fccid=b1e0f66da5c9df23&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=6e033c0acc4b75cd&fccid=be240c643a8631c5&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=23a20a86d43a8808&fccid=be240c643a8631c5&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=4cb1dbd5b17ad01b&fccid=b1e0f66da5c9df23&vjs=3']
Getting page 27
['http://www.indeed.co.in/rc/clk?jk=6816c623a400fdad&fccid=734cb5a01ee60f80&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=d952eec5fb55ca34&f

Getting page 32
['http://www.indeed.co.in/rc/clk?jk=6816c623a400fdad&fccid=734cb5a01ee60f80&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=42883eaa3c67e5a2&fccid=b1e0f66da5c9df23&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=a8d41e03932f4f79&fccid=b1e0f66da5c9df23&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=d952eec5fb55ca34&fccid=b1e0f66da5c9df23&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=efce8ed0257ddff9&fccid=1c5cfc525b841c32&vjs=3', 'http://www.indeed.co.in/pagead/clk?mo=r&ad=-6NYlbfkN0AsDgJcjf3QstaB-NhPfcDvitvdOui-gURn6hD2sXcTqCNf98SJ_tS8y2uKQcqVtkHP2y49q9jUGnXDCGJkrAvh_LoxyJlovUms0KGCO9foBtAmE8ehqw9y6sGuFj01v8nFKPA4eJfKSoF9MSeq3tqBBib90nh4L_FzEcvbUgu_8SIVv6oNquEz_CF6xcLOKWum1qeLGMkjxWLVLdfQ-5_181UhVXZy7V7xZ4qUhojHnSGpumwOHyNp0ym0Dv5sXfFjmfDq-dODkDtufNFZT06AsacecfjAKrxmXakwrcfv3nxAVLQsAFZ1w9zUuhtSA6Z06zjM2K2ONQAQ89puI5RgLxSe103KX0Nyp1Tm-sfKZCms-wHysJuesrcbXSoAYckeF2jaz1A79rTP54S5zAMxHrmvdCRqVZwNBVC4AnIq8_9zIquMzdnmxz4doxqhVeLON3_M0dE7hDFKZ9JgZjBpCHGYk6v6auPSsaMrvV4vNce4bWxJ3uWPEwuGmnPjfGsB

Getting page 34
['http://www.indeed.co.in/rc/clk?jk=efce8ed0257ddff9&fccid=1c5cfc525b841c32&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=bf842275ed3d7542&fccid=024f492ec424f077&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=11b8fcee57851b1c&fccid=b1e0f66da5c9df23&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=b8a1b25e7807993d&fccid=fe2d21eef233e94a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=6bb400e1c351bb4a&fccid=6736cde444d98758&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=2f8c1bebbf58924c&fccid=d40ebe11fc879426&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=e21faff2f046fefb&fccid=be240c643a8631c5&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=dd2f5f8fea23d20a&fccid=b82cd68dfe3e06b4&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=23a20a86d43a8808&fccid=be240c643a8631c5&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=a0e44abf54c5b9c9&fccid=b46520827b78454f&vjs=3', 'http://www.indeed.co.in/pagead/clk?mo=r&ad=-6NYlbfkN0AsDgJcjf3QstaB-NhPfcDvitvdOui-gURn6hD2sXcTqCNf98SJ_tS8y2uKQcqVtkG4THHLts8G-Sv3JS-5glwH6IyuIJ8s5HUzZ

Getting page 37
['http://www.indeed.co.in/pagead/clk?mo=r&ad=-6NYlbfkN0DGWbMxlqSnT1fiijd83xEf-YAaxqSlo08xiEbCyqV61bonpf0d4kx_xw6JXJ0bRVNGo72Wl38IOuBIrIp6m98aBLZDYfB2YeCxq1-61E0f4WQxw2cyHN0qEBkiJQAxkhfWi10qnOpbrqsDiRpqqBe3aITs3vyhDodGQu3Hj-J7oUnPUx8ZopfqV0mc_8XmPt-WMtNMeT-20gKdfLwzUnmkvV7Ts_fUNuRktU3mXLEvK81rxt9DvpKE6NsbYG9YkmigoHjNjGa6c4SVEgzRKlKvRPT8Ixzvm2Cc5W77iO3t_A3Kb_3FmL1ESKQHdAkwoS6ceitbe6oG3i-0ovARARf49g_A3nF1UauSipC3kkn0xDp-WX9sjsATXAQPxU2kQcV7Ag7Ix0FVpCQQTAgDLMRBp-xEMrO9OQ_NkgVYiOFa1EhbFCl77ad4MHLInwk_kGmIGNQ4Us11luKJnK7bidKKRfyltQdBB-SZtg6D0YZNZGhv6PWOiu_G0FmA8sIKeOxrlALp7HmG5JmlqP_mZQtCVqKu6r-_MgCEa5UXuoh-GetaYslBtO-Id22UbNcqnn5KPy93hoete9Yaw9H0r4EgUaZxXwKQlPFdkRKD4tcVqsTsrQ8yeO5NxzNTSP_4OMSI_OSjkvfYCC6cW2SmSNST9jwzDK1ycEBOHinNv5Rtkrvhgllh0unlHuWiZoR1sqxfact-nS_H78r2xpMk80pQB1hXoGi174TXrBQN5n71wc-tLEDVlZGMZ8ZTPgnRvE0zII5RypdQeD5nd0hpmfa9BT8oPy7dCq8pcKlSfx2gG_m3at5kKvJuLLf24CkTNVQxzkuHXDafGR4FTZIXVAZpFeWLIGrKULAwvbQujaPluNwHLD2InCL01PLO9Og9VDjEQXH29xdBSgvFsUdWUfrHRkKDnBcTmAC

Getting page 40
['http://www.indeed.co.in/rc/clk?jk=99a0f030d373ec54&fccid=cd1c087ed0fcb7d8&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=b9c0ebda601461c5&fccid=21df030fae150acc&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=f4abbbfc2c0b08e6&fccid=d3d3520998346837&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=eb19692d183dd182&fccid=59b5ebe2be23b284&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=3e060ff4255bc3d5&fccid=fe2d21eef233e94a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=1398147ffaaf489e&fccid=fe2d21eef233e94a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=42d486e850cdec94&fccid=e0efd5c38293090b&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=9432e386f38dcc7c&fccid=fe2d21eef233e94a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=eb688aa3261ea8b5&fccid=dd616958bd9ddc12&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=4a94dd89a9e4be65&fccid=7ab93833a090100a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=ac9bb1ca7d99f182&fccid=b1e0f66da5c9df23&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=0d18908418825b55&fccid=5d3ed74ca459

Getting page 46
['http://www.indeed.co.in/rc/clk?jk=99a0f030d373ec54&fccid=cd1c087ed0fcb7d8&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=b9c0ebda601461c5&fccid=21df030fae150acc&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=f4abbbfc2c0b08e6&fccid=d3d3520998346837&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=eb19692d183dd182&fccid=59b5ebe2be23b284&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=3e060ff4255bc3d5&fccid=fe2d21eef233e94a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=1398147ffaaf489e&fccid=fe2d21eef233e94a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=42d486e850cdec94&fccid=e0efd5c38293090b&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=9432e386f38dcc7c&fccid=fe2d21eef233e94a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=eb688aa3261ea8b5&fccid=dd616958bd9ddc12&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=4a94dd89a9e4be65&fccid=7ab93833a090100a&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=ac9bb1ca7d99f182&fccid=b1e0f66da5c9df23&vjs=3', 'http://www.indeed.co.in/rc/clk?jk=0d18908418825b55&fccid=5d3ed74ca459

Getting page 50


URLError: <urlopen error [Errno 11001] getaddrinfo failed>

In [11]:
job_info.head()

NameError: name 'job_info' is not defined