<a href="https://colab.research.google.com/github/gautamdhanrajnemmaniwar/Project-on-Navigating-the-Data-Science-Job-Landscape/blob/main/Mid_Course_Summative_Assignment_Numerical_Programming_in_Python_Analyze_it_Yourself.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Problem Statement: Navigating the Data Science Job Landscape**

🚀 Unleash your creativity in crafting a solution that taps into the heartbeat of the data science job market! Envision an ingenious project that seamlessly wields cutting-edge web scraping techniques and illuminating data analysis.

🔍 Your mission? To engineer a tool that effortlessly gathers job listings from a multitude of online sources, extracting pivotal nuggets such as job descriptions, qualifications, locations, and salaries.

🧩 However, the true puzzle lies in deciphering this trove of data. Can your solution discern patterns that spotlight the most coveted skills? Are there threads connecting job types to compensation packages? How might it predict shifts in industry demand?

🎯 The core objectives of this challenge are as follows:

1. Web Scraping Mastery: Forge an adaptable and potent web scraping mechanism. Your creation should adeptly harvest data science job postings from a diverse array of online platforms. Be ready to navigate evolving website structures and process hefty data loads.

2. Data Symphony: Skillfully distill vital insights from the harvested job listings. Extract and cleanse critical information like job titles, company names, descriptions, qualifications, salaries, locations, and deadlines. Think data refinement and organization.

3. Market Wizardry: Conjure up analytical tools that conjure meaningful revelations from the gathered data. Dive into the abyss of job demand trends, geographic distribution, salary variations tied to experience and location, favored qualifications, and emerging skill demands.

4. Visual Magic: Weave a tapestry of visualization magic. Design captivating charts, graphs, and visual representations that paint a crystal-clear picture of the analyzed data. Make these visuals the compass that guides users through job market intricacies.

🌐 While the web scraping universe is yours to explore, consider these platforms as potential stomping grounds:

* LinkedIn Jobs
* Indeed
* Naukri
* Glassdoor
* AngelList

🎈 Your solution should not only decode the data science job realm but also empower professionals, job seekers, and recruiters to harness the dynamic shifts of the industry. The path is open, the challenge beckons – are you ready to embark on this exciting journey?






# **Project Name** - Navigating the Data Science Job Landscape

* **Project Type** - Web Scraping and Data Visualization on Job Postings
* **Project Member** - Gautam Dhanraj Anita Nemmaniwar

# **Project Summary**

* The purpose of these project is to **analyse the market for data science jobs** that are available all over India and harvest vital insights that can help professionals, job seekers, and recruiters to utilize this data inorder to understand the dynamic shifts of the industry.

* Here we are going to first **scrap data** from data science job postings that are available online on many sites like LinkedIn, Indeed, Naukri, Shine, etc. Then we are going to do few **data wrangling** operations to organize our data that can be readily used for **data visualization**.

* At last we are going to end the project with a brief and suffice conclusion that gives an overall understanding of the project.

# **GitHub Link -**

# **Problem Statement**


The primary objectives of this project is:

* To perform and understand **web scraping** mechanism inorder to gather data on data science job postings from online platforms.

* To perform various **data wrangling** operations like removing or replacing data, adding a new column, changing the datatype, etc.

* To analyse data and collect important insights with the help of charts, graphs, and visual representations (**data visualization**).

# ***Let's Begin !***

## ***1. Web Scraping***

In [2]:
# Importing Libraries

import warnings
warnings.filterwarnings("ignore")
from bs4 import BeautifulSoup
import requests
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams['font.size'] = 12


In [3]:
# Making a request to the website for multiple pages

page = 1   # Starting from page no. 1 of the job website
responses_list = []   # Empty list to store the response for all pages

while page != 50:

  # Website link from where the job postings are to be extracted
  url = f"https://www.shine.com/job-search/data-science-jobs-{page}?top_companies_boost=true&q=Data%20Science,&location=243&location=437&location=249&location=246&location=244&location=423&location=247&location=453&location=406&location=285&location=329&location=424&location=353&location=310&location=404&location=522&location=442&location=378&location=321&location=290&location=523&location=278&location=358&location=315&location=291&location=289&location=330&location=305&location=386&location=328&location=322&location=242&location=400&location=364&location=272"

  # Request sent to the site inorder to get the URL for the specific page
  response = requests.get(url)

  # Print the response code after getting the response of the request we sent
  print(f"The response that we got back from the URL is {response.status_code}.")

  # Adding the response of each page to the list
  responses_list.append(response)

  # Incrementing the page by 1 to go to next page
  page += 1


The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back from the URL is 200.
The response that we got back f

In [4]:
# Parsing through HTML for each page

soup_list = []   # Empty list to store all parsed data
html_list = []   # Empty list to store all HTML data

for i in responses_list:
  soup = BeautifulSoup(i.text,'html.parser')   # Parsing through the HTML data for each page
  soup_list.append(soup)

  html = soup.find_all('div')   # Extracting all the HTML data for 'div' from each page
  html_list.append(html)


***Job Title***

In [77]:
# Finding Job Titles from each page

titles_list = []   # Empty list for storing all the job titles

for i in soup_list:

  # Obtaining the HTML data for job title
  req = i.select('div h2[itemprop = "name"]')

  # Getting the text from the HTML for each page
  titles = [r.text for r in req]

  # Removing spaces if any
  titles = [t.replace("  ", " ").strip() for t in titles]

  # Adding job titles of each page to a list for each job positing
  titles_list.append(titles)

# Flatten List using sum() function
titles_list = sum(titles_list, [])

# Print all the job titles
titles_list


['Data Scientist',
 'Data Scientist',
 'Data Science',
 'for __Data Scientist Mumbai',
 'Required for _Data Scientist Mumbai',
 'Data Scientist Recruitment',
 'Data Scientist',
 'Data Scientist',
 'Senior Data Scientist',
 'Data Scientist Lead',
 'Gen AI- Developer',
 'Gen AI- Developer',
 'Data Scientist Artificial Intelligence',
 'Data Scientist Vacancy',
 'Data Scientist Vacancy',
 'Data Scientist',
 'Profile for _Data Science Consultant',
 'Senior Consultant/Principal - Business Consulting (AILA ...',
 'Hiring For Data Scientist',
 'Hiring For Data Scientist',
 'Data Scientist (ML/AI) Pune',
 'Senior Data Scientist',
 'Data scientist Drivetrain',
 'Machine Learning Engineer',
 'Research Manager',
 'Engineer II - Data scientist',
 'Asset Management Marketing Intelligence',
 'Technical Product Owner- Data warehousing ,data visuali ...',
 'Business Processes Consultant- AI/ML (Data Services/Int ...',
 'Data Scientist',
 'Data Scientist Artificial Intelligence',
 'Product Data Science 

***Company Name***

In [79]:
# Finding Company Names from each page

company_list = []   # Empty list for storing all the company names

for i in soup_list:

  # Obtaining the HTML data from the class where the company name data is present
  req1 = soup.find_all('div', class_ = 'jobCard_jobCard_cName__mYnow')

  # Fetching the text from the HTML for each page
  company = [r.text for r in req1]
  sub_string = 'Hiring'

  # Splitting the string on a sub_string and getting the first index (cleaning up names)
  company = [c.split(sub_string)[0] for c in company]

  # Removing spaces if any
  company = [c.strip().replace(".", "") for c in company]

  # Adding the company name to the list for each job positing
  company_list.append(company)

# Flatten List using sum() function
company_list = sum(company_list, [])

# Print all the company names
company_list


['MNR Solutions Pvt Ltd',
 'EXPONUS TRADELINK PRIVATE LIMITED',
 'CORPORATE STEPS',
 'Alpine Manpower Services',
 'Alpine Manpower Services',
 'Alpine Manpower Services',
 'Alpine Manpower Services',
 'Alpine Manpower Services',
 'Alpine Manpower Services',
 'Alpine Manpower Services',
 'Manish Enterprises',
 'Manish Enterprises',
 'Manish Enterprises',
 'Manish Enterprises',
 'Manish Enterprises',
 'Merck Ltd',
 'Clyent Technologies',
 'Merck Ltd',
 'MNR Solutions Pvt Ltd',
 'QUISCON BIOTECH',
 'MNR Solutions Pvt Ltd',
 'EXPONUS TRADELINK PRIVATE LIMITED',
 'CORPORATE STEPS',
 'Alpine Manpower Services',
 'Alpine Manpower Services',
 'Alpine Manpower Services',
 'Alpine Manpower Services',
 'Alpine Manpower Services',
 'Alpine Manpower Services',
 'Alpine Manpower Services',
 'Manish Enterprises',
 'Manish Enterprises',
 'Manish Enterprises',
 'Manish Enterprises',
 'Manish Enterprises',
 'Merck Ltd',
 'Clyent Technologies',
 'Merck Ltd',
 'MNR Solutions Pvt Ltd',
 'QUISCON BIOTECH',


***Job Location***

In [80]:
# Finding Job Locations from each page

location_list = []   # Empty list for storing all the job locations

for i in soup_list:

  # Obtaining the HTML data from the class where the job location data is present
  req2 = i.find_all('div', class_ = 'jobCard_jobCard_lists__fdnsc')

  # Obtaining all the text from the HTML
  location = [r.text for r in req2]

  # Cleaning up using regex
  location = [re.findall("Yrs?(.*)$", l)[0] for l in location]

  # Getting rid of unnecessary text
  location = [l.replace("Mumbai City", "Mumbai") for l in location]
  location = ["All India" if ',' in l or '+' in l else l for l in location]

  # Adding the job location to the list for each job positing
  location_list.append(location)

# Flatten List using sum() function
location_list = sum(location_list, [])

# Print all the job locations
location_list


['All India',
 'All India',
 'Bangalore',
 'Mumbai',
 'Mumbai',
 'All India',
 'Bangalore',
 'All India',
 'All India',
 'All India',
 'Pune',
 'Bangalore',
 'All India',
 'All India',
 'All India',
 'All India',
 'All India',
 'Bangalore',
 'All India',
 'All India',
 'Pune',
 'Bangalore',
 'Bangalore',
 'Bangalore',
 'All India',
 'Bangalore',
 'Mumbai',
 'Bangalore',
 'Pune',
 'Pune',
 'Gurugram',
 'Bangalore',
 'Bangalore',
 'Bangalore',
 'Chennai',
 'Bangalore',
 'Gurugram',
 'Mumbai',
 'Chennai',
 'Pune',
 'Chennai',
 'Bangalore',
 'Bangalore',
 'Bangalore',
 'Bangalore',
 'Bangalore',
 'All India',
 'Pune',
 'Chennai',
 'All India',
 'Pune',
 'Bangalore',
 'Gurugram',
 'Noida',
 'Chennai',
 'Mumbai',
 'Mumbai',
 'Bangalore',
 'Mumbai',
 'Mumbai',
 'Bangalore',
 'All India',
 'All India',
 'Mumbai',
 'Hyderabad',
 'Mumbai',
 'Kochi',
 'Kochi',
 'Bangalore',
 'Udaipur',
 'Bangalore',
 'Gurugram',
 'All India',
 'All India',
 'Mumbai',
 'All India',
 'Chennai',
 'All India',
 'Bang

***Job Experience***

In [81]:
# Finding Job Experience requried from each page

experience_list = []   # Empty list for storing all the job experience data

for i in soup_list:

  # Obtaining the HTML data from the class where the job experience data is present
  req3 = i.find_all('div', class_ = 'jobCard_jobCard_lists__fdnsc')

  # Obtaining all the text from the HTML
  experience = [r.text for r in req3]

  # Cleaning up using regex
  experience = [re.findall("^(.*) Yrs?", e)[0] for e in experience]

  # Adding the job experience to the list for each job positing
  experience_list.append(experience)

# Flatten List using sum() function
experience_list = sum(experience_list, [])

# Print all the job experience requried
experience_list


['3 to 6',
 '3 to 6',
 '0 to 2',
 '2 to 7',
 '2 to 7',
 '0 to 4',
 '0 to 4',
 '3 to 5',
 '5 to 10',
 '3 to 5',
 '0 to 2',
 '0 to 2',
 '7 to 12',
 '0 to 4',
 '0 to 4',
 '14 to 24',
 '3 to 6',
 '4 to 9',
 '0 to 4',
 '0 to 4',
 '0 to 1',
 '22 to 24',
 '0 to 4',
 '6 to 8',
 '12 to 14',
 '3 to 6',
 '0 to 4',
 '3 to 5',
 '5 to 7',
 '4 to 7',
 '6 to 8',
 '8 to 10',
 '7 to 10',
 '8 to 10',
 '0 to 4',
 '5 to 7',
 '5 to 10',
 '2 to 7',
 '8 to 13',
 '0 to 4',
 '12 to 15',
 '0 to 2',
 '0 to 5',
 '7 to 9',
 '5 to 7',
 '0 to 4',
 '0 to 1',
 '0 to 4',
 '5 to 6',
 '8 to 13',
 '5 to 7',
 '5 to 7',
 '6 to 8',
 '10 to 12',
 '7 to 12',
 '2 to 7',
 '4 to 6',
 '3 to 4',
 '4 to 6',
 '4 to 6',
 '1 to 7',
 '5 to 9',
 '1 to 5',
 '2 to 7',
 '4 to 9',
 '3 to 8',
 '12 to 18',
 '6 to 10',
 '1 to 2',
 '3 to 5',
 '5 to 10',
 '7 to 10',
 '0 to 4',
 '0 to 4',
 '3 to 8',
 '0 to 2',
 '3 to 5',
 '0 to 4',
 '5 to 10',
 '3 to 6',
 '4 to 6',
 '5 to 8',
 '2 to 4',
 '3 to 6',
 '0 to 1',
 '6 to 8',
 '6 to 8',
 '6 to 8',
 '5 to 

In [82]:
# Creating a new list for job experience that contains only one value

new_experience_list = []   # Creating a new list to store the 0th index of the string as experience

for e in experience_list:
  new_experience_list.append(e[0])   # Adding the initial experience value (0th Index) to the list

# Print all the 0th Index of job experience requried
new_experience_list


['3',
 '3',
 '0',
 '2',
 '2',
 '0',
 '0',
 '3',
 '5',
 '3',
 '0',
 '0',
 '7',
 '0',
 '0',
 '1',
 '3',
 '4',
 '0',
 '0',
 '0',
 '2',
 '0',
 '6',
 '1',
 '3',
 '0',
 '3',
 '5',
 '4',
 '6',
 '8',
 '7',
 '8',
 '0',
 '5',
 '5',
 '2',
 '8',
 '0',
 '1',
 '0',
 '0',
 '7',
 '5',
 '0',
 '0',
 '0',
 '5',
 '8',
 '5',
 '5',
 '6',
 '1',
 '7',
 '2',
 '4',
 '3',
 '4',
 '4',
 '1',
 '5',
 '1',
 '2',
 '4',
 '3',
 '1',
 '6',
 '1',
 '3',
 '5',
 '7',
 '0',
 '0',
 '3',
 '0',
 '3',
 '0',
 '5',
 '3',
 '4',
 '5',
 '2',
 '3',
 '0',
 '6',
 '6',
 '6',
 '5',
 '2',
 '2',
 '2',
 '2',
 '2',
 '5',
 '0',
 '0',
 '0',
 '2',
 '5',
 '0',
 '8',
 '0',
 '0',
 '5',
 '5',
 '3',
 '6',
 '6',
 '6',
 '6',
 '6',
 '3',
 '3',
 '2',
 '3',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '2',
 '0',
 '0',
 '3',
 '0',
 '3',
 '3',
 '3',
 '0',
 '0',
 '5',
 '3',
 '3',
 '3',
 '4',
 '2',
 '3',
 '3',
 '5',
 '0',
 '0',
 '0',
 '7',
 '8',
 '3',
 '1',
 '2',
 '0',
 '5',
 '3',
 '0',
 '4',
 '3',
 '7',
 '8',
 '0',
 '5',
 '5'

***Job Vacancies***

In [83]:
# Finding Number of vacancies available from each page

vacancies_list = []   # Empty list for storing all the vacancies related data

for i in soup_list:

  # Obtaining the HTML data from the class where the no. of vacancies data is present
  req4 = soup.find_all('ul', class_='jobCard_jobCard_jobDetail__jD82J')

  # Getting the text from the HTML
  vacancies = [r.text for r in req4 ]

  # Cleaning up the data using regex
  vacancies = [int(re.findall(r'\d+', v)[0]) if re.findall(r'\d+', v) else 1 for v in vacancies]

  # Adding the no. of vacancies to the list for each job positing
  vacancies_list.append(vacancies)

# Flatten List using sum() function
vacancies_list = sum(vacancies_list, [])

# Print all the job vacancies available
vacancies_list


[1,
 99,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 10,
 1,
 8,
 12,
 1,
 99,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 10,
 1,
 8,
 12,
 1,
 99,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 10,
 1,
 8,
 12,
 1,
 99,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 10,
 1,
 8,
 12,
 1,
 99,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 10,
 1,
 8,
 12,
 1,
 99,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 10,
 1,
 8,
 12,
 1,
 99,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 10,
 1,
 8,
 12,
 1,
 99,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 10,
 1,
 8,
 12,
 1,
 99,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 10,
 1,
 8,
 12,
 1,
 99,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 10,
 1,
 8,
 12,
 1,
 99,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 10,
 1,
 8,
 12,
 1,
 99,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 10,
 1,
 8,
 12,
 1,


In [43]:
# Let's put together all the data extracted from HTML into main DATAFRAME

job_data = {'Titles': titles_list, 'Company Name': company_list, 'Job Location': location_list, 'Experience (in range)': experience_list, 'Total Vacancies': vacancies_list, 'Experience': new_experience_list}

df = pd.DataFrame(job_data)   # Pandas Dataframe


In [44]:
df

Unnamed: 0,Titles,Company Name,Job Location,Experience (in range),Total Vacancies,Experience
0,Data Scientist,MNR Solutions Pvt Ltd,All India,3 to 6,1,3
1,Data Scientist,EXPONUS TRADELINK PRIVATE LIMITED,All India,3 to 6,99,3
2,Data Science,CORPORATE STEPS..,Bangalore,0 to 2,1,0
3,for__Data ScientistMumbai,Alpine Manpower Services,Mumbai,2 to 7,1,2
4,Required for _Data ScientistMumbai,Alpine Manpower Services,Mumbai,2 to 7,1,2
...,...,...,...,...,...,...
975,Sr Analyst,Merck Ltd,Bangalore,4 to 9,1,4
976,DBT Lead With Python and SnowFlake Bangalore,Clyent Technologies,Bangalore,5 to 10,10,5
977,Associate Clinical Supply Services,Merck Ltd,Bangalore,1 to 3,1,1
978,Hiring for data analyst- Chennai,MNR Solutions Pvt Ltd,Chennai,0 to 2,8,0


## ***2. Data Wrangling***

In [46]:
# Dataset First Look
# First 5 rows of the dataset

df.head()


Unnamed: 0,Titles,Company Name,Job Location,Experience (in range),Total Vacancies,Experience
0,Data Scientist,MNR Solutions Pvt Ltd,All India,3 to 6,1,3
1,Data Scientist,EXPONUS TRADELINK PRIVATE LIMITED,All India,3 to 6,99,3
2,Data Science,CORPORATE STEPS..,Bangalore,0 to 2,1,0
3,for__Data ScientistMumbai,Alpine Manpower Services,Mumbai,2 to 7,1,2
4,Required for _Data ScientistMumbai,Alpine Manpower Services,Mumbai,2 to 7,1,2


In [47]:
# Dataset Rows & Columns count

print(df.shape)
print(f"Total number of Rows: {len(df)}")
print(f"Total number of Columns: {len(df.columns)}")


(980, 6)
Total number of Rows: 980
Total number of Columns: 6


In [48]:
# Dataset Info
# First hand info about dataset

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 980 entries, 0 to 979
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Titles                 980 non-null    object
 1   Company Name           980 non-null    object
 2   Job Location           980 non-null    object
 3   Experience (in range)  980 non-null    object
 4   Total Vacancies        980 non-null    int64 
 5   Experience             980 non-null    object
dtypes: int64(1), object(5)
memory usage: 46.1+ KB


In [49]:
# Dataset Columns
# List of columns in the data frame

df.columns


Index(['Titles', 'Company Name', 'Job Location', 'Experience (in range)',
       'Total Vacancies', 'Experience'],
      dtype='object')

In [51]:
# Check unique values for each variable (columns)

print("Number of uniques values for each column is:\n")
for i in df.columns.tolist():
  print(f"{i} ({df[i].nunique()}) : {df[i].unique()}\n")


Number of uniques values for each column is:

Titles (706) : ['Data Scientist ' 'Data Science' ' for__Data ScientistMumbai'
 'Required for _Data ScientistMumbai' 'Data Scientist Recruitment'
 'Data Scientist' 'Senior Data Scientist' 'Data Scientist Lead'
 'Gen AI- Developer' 'Data Scientist Artificial Intelligence'
 'Data Scientist Vacancy' 'Profile for_Data Science Consultant'
 'Senior Consultant/Principal - Business Consulting (AILA ...'
 'Hiring For Data Scientist' 'Data Scientist (ML/AI) Pune'
 'Data scientistDrivetrain' 'Machine Learning Engineer' 'Research Manager'
 'Engineer II - Data scientist' 'Asset Management Marketing Intelligence'
 'Technical Product Owner- Data warehousing ,data visuali ...'
 'Business Processes Consultant- AI/ML (Data Services/Int ...'
 'Product Data Science Manager, People Analytics'
 'Diagnostic Data Engineer for a Leading Automotive Compa ...'
 'Associate Sr. - Data Analytics'
 'Associate, Associate Data Manager, Clinical Data Scienc ...'
 'Senior Dat

In [52]:
# Total number of rows before removing duplicates
print(f"Total number of rows before removing any duplicate rows are: {df.shape}\n")

# Number of duplicate rows
print(f"Number of duplicate rows in the dataset are: {df[df.duplicated()].shape}\n")

# Removing duplicate rows
df.drop_duplicates(inplace = True)

# Total number of rows after removing duplicates
print(f"Total number of rows after removing duplicate rows are: {df.shape}")


Total number of rows before removing any duplicate rows are: (980, 6)

Number of duplicate rows in the dataset are: (51, 6)

Total number of rows after removing duplicate rows are: (929, 6)


In [53]:
# Checking if their are any missing values

df.isnull().sum()


Titles                   0
Company Name             0
Job Location             0
Experience (in range)    0
Total Vacancies          0
Experience               0
dtype: int64

In [54]:
# Changing datatype of the column 'Experience'

df['Experience'] = df['Experience'].astype('int64')


In [60]:
df['Total Vacancies'].unique()

array([ 1, 99, 10,  8, 12])

In [61]:
# Adding new column

# Creating a column known as 'Vacancy_Category' by performing a nested condition on column 'Total Vacancies'
df['Vacancy_Category'] = np.where(df['Total Vacancies'] == 1, 'Single Vacancy', np.where((df['Total Vacancies'] > 1) & (df['Total Vacancies'] < 15), 'Medium Vacancies', 'High Vacancies'))

# Creating a column known as 'Experience_Category' by performing a nested condition on column 'Experience'
df['Experience_Category'] = np.where(df['Experience'] == 0, 'Fresher', np.where((df['Experience'] > 0) & (df['Experience'] < 6), 'Medium Experience', 'High Experience'))



In [62]:
# Let's see the dataset after adding new columns

df.head()


Unnamed: 0,Titles,Company Name,Job Location,Experience (in range),Total Vacancies,Experience,Experience_Category,Vacancy_Category
0,Data Scientist,MNR Solutions Pvt Ltd,All India,3 to 6,1,3,Medium Experience,Single Vacancy
1,Data Scientist,EXPONUS TRADELINK PRIVATE LIMITED,All India,3 to 6,99,3,Medium Experience,High Vacancies
2,Data Science,CORPORATE STEPS..,Bangalore,0 to 2,1,0,Fresher,Single Vacancy
3,for__Data ScientistMumbai,Alpine Manpower Services,Mumbai,2 to 7,1,2,Medium Experience,Single Vacancy
4,Required for _Data ScientistMumbai,Alpine Manpower Services,Mumbai,2 to 7,1,2,Medium Experience,Single Vacancy


## ***3. Data Vizualization***

# **Conclusion**