# LinkedIn Tech Jobs Data Visualization Project
We find the dataset on Kaggle : https://www.kaggle.com/datasets/joebeachcapital/linkedin-jobs/data

This project will use data visualization to explore the LinkedIn Tech Jobs dataset and identify key trends, such as the most in-demand skills, how demand has changed over time, and how demand varies across industries and job titles.

Interactive data visualizations will allow users to explore the data in more detail.

The goal is to provide insights into the tech job market that can be used to make informed decisions about careers.

Examples of data visualizations you can find in our project:


*   Correlation between the number of LinkedIn followers and the number of job
*   The distribution of the number of applicants for the job listings
*   The top 10 companies with the highest average number of applicants
*   Distribution of data per city in India
*   Pareto chart of data
*   India City Distribution
*   Ratio number candidat by employee (to know easy or difficult to access)
*   Number of applicant by industry (to know the most popular industry)
*   The 10 skills that appear the most
*   Word Cloud of Job Titles in LinkedIn Tech Jobs Dataset

In each code you can find short comments explaining briefly what we are doing, but you may find more detailed explanations in the text section just after the corresponding code.

In [10]:
'''Step 1 : importing libraries'''
# Import Libraries to preprocesing Data
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import tkinter as tk
from tkinter import ttk
import matplotlib.pyplot as plt
from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg
import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import os
import matplotlib as mpl
if os.environ.get('DISPLAY','') == '':
    print('no display found. Using non-interactive Agg backend')
    mpl.use('Agg')

### CREATE VIRTUAL DISPLAY ###
!apt-get install -y xvfb # Install X Virtual Frame Buffer
import os
os.system('Xvfb :1 -screen 0 1600x1200x16  &')    # create virtual display with size 1600x1200 and 16 bit color. Color can be changed to 24 or 8
os.environ['DISPLAY']=':1.0'    # tell X clients to use our virtual DISPLAY :1.0.

"""
# If someone wants to try it with Numpy
url = 'https://raw.githubusercontent.com/Jhonnatan7br/LinkedIn-Tech/main/LinkedIn%20Tech%20jobs%20-%20Informatic%20and%20Telecoms.csv'
df = np.genfromtxt(url, delimiter=',', skip_header=1, dtype=None, encoding=None)
print(df)
"""
url = 'https://raw.githubusercontent.com/Jhonnatan7br/LinkedIn-Tech/main/LinkedIn%20Tech%20jobs%20-%20Informatic%20and%20Telecoms.csv'
df = pd.read_csv(url,  sep=",")

"""Step 2 :Explore the DataFrame to get an understanding of the data:"""
# Display all the DataSet (It cant show all because it is too large)
df.head()

# Display the first few rows of the DataFrame
#df.head()
#df.describe()

""" Step 3 : Clean the data:"""
# Check for missing values in the dataset
missing_values = df.isnull().sum()
# Remove any duplicate rows
df = df.drop_duplicates()
# Remove any rows with missing values
df = df.dropna()

# Get a summary of the numerical columns
numerical_summary = df.describe()
missing_values, numerical_summary


"apt-get" no se reconoce como un comando interno o externo,
programa o archivo por lotes ejecutable.


(Company_Name          0
 Class                 0
 Designation           0
 Location              0
 Total_applicants      0
 LinkedIn_Followers    0
 Level                 0
 Involvement           0
 Employee_count        0
 Industry              0
 PYTHON                0
 C++                   0
 JAVA                  0
 HADOOP                0
 SCALA                 0
 FLASK                 0
 PANDAS                0
 SPARK                 0
 NUMPY                 0
 PHP                   0
 SQL                   0
 MYSQL                 0
 CSS                   0
 MONGODB               0
 NLTK                  0
 TENSORFLOW            0
 LINUX                 0
 RUBY                  0
 JAVASCRIPT            0
 DJANGO                0
 REACT                 0
 REACTJS               0
 AI                    0
 UI                    0
 TABLEAU               0
 NODEJS                0
 EXCEL                 0
 POWER BI              0
 SELENIUM              0
 HTML                  0


# Additional details and explanations :
 **Step 1 : Importing libraries**

> The import statements introduce external libraries that provide specific functionalities:


*   requests: Makes HTTP requests for interacting with web APIs and retrieving data.
*   numpy: Performs efficient numerical operations and data manipulation.
*   pandas: Analyzes and manipulates tabular data, time series, and statistics.
*   matplotlib.pyplot: Creates 2D visualizations like line plots, scatter plots, and bar charts.
*   seaborn: Builds on matplotlib to create visually appealing and informative data visualizations.

**Step 2 : Explore the DataFrame**

> The provided code snippet represents the second step in data analysis, which involves exploring the DataFrame to gain an understanding of the data's structure, content, and characteristics.

* print(df.head()): This line displays the first few rows of the DataFrame, providing a glimpse into the data's format and the types of information it contains.

* df.describe(): While not explicitly executed in the code, the df.describe() method provides summary statistics for each numerical column in the DataFrame. This includes measures like mean, median, standard deviation, and quartiles, helping to understand the distribution and central tendency of the data.

>By exploring the DataFrame using these methods, data analysts can identify patterns, anomalies, and potential relationships between variables, laying the foundation for further analysis and exploration.

**Step 3 : Clean the data**
> The provided code snippet represents the third step in data analysis, which involves cleaning and preparing the data for further analysis.

* Check for missing values: The df.isnull().sum() method calculates the number of missing values (NA or NaN) in each column of the DataFrame. The resulting Series, missing_values, holds the column names as its index and the corresponding number of missing values as values. This information can help identify columns with a significant amount of missing data that may require further attention.

* Remove duplicate rows: The df.drop_duplicates() method eliminates duplicate rows in the DataFrame. Duplicate rows are identified based on all columns in the DataFrame, ensuring that each unique combination of values is represented only once. This step can help remove redundant data and ensure that subsequent analysis is based on a distinct set of records.

* Remove rows with missing values: The df.dropna() method removes rows with at least one missing value. This step eliminates incomplete records that may introduce errors or biases in the analysis. However, it is important to consider the impact of removing a significant portion of the data and whether imputation techniques could be used to address missing values instead.

>Get a summary of the numerical columns: The df.describe() method provides summary statistics for each numerical column in the DataFrame. This includes measures like mean, median, standard deviation, and quartiles, helping to understand the distribution and central tendency of the data. This information can be used to identify potential outliers, detect skewed distributions, and assess the overall variability of the numerical variables.


In [9]:
# Create a function to generate the plot
# Modern Color Scheme
bg_color = "#f5f5f5"  # Light grey background
fg_color = "#333333"  # Dark text
accent_color = "#007bff"  # Blue accent

import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import tkinter as tk
from tkinter import ttk
import matplotlib.pyplot as plt
from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg
import matplotlib

# Sample DataFrame

df = pd.DataFrame({
    'Location': ['City1', 'City2', 'City3', 'City4', 'City5', 'City1', 'City2', 'City3']
})

# Modern Color Scheme
bg_color = "#f5f5f5"  # Light grey background
fg_color = "#333333"  # Dark text
accent_color = "#007bff"  # Blue accent

def plot_distribution():
    # Get the top 5 cities in India with the highest frequency from the 'Location' column
    Top_5_cities_india = df['Location'].value_counts().sort_values(ascending=False).head(5)

    # Create a figure with a specific size
    fig, ax = plt.subplots(figsize=(10, 6))

    # Create a bar plot with the top 5 cities on the x-axis and their frequencies on the y-axis
    sns.barplot(x=Top_5_cities_india.index, y=Top_5_cities_india.values, ax=ax)

    # Add labels to the bars
    for p in ax.patches:
        ax.annotate(f'{p.get_height():.0f}', (p.get_x() + p.get_width() / 2., p.get_height()),
                    ha='center', va='center', fontsize=12, color='black', xytext=(0, 5),
                    textcoords='offset points')

    # Rotate x-axis labels for better readability
    plt.xticks(rotation=45)

    # Set labels for the x and y axes
    plt.xlabel('Cities')
    plt.ylabel('Frequency')

    # Set the title of the plot
    plt.title('Distribution of Data per City in India - Top 5')

    # Customize the plot style (optional)
    sns.set(style="whitegrid")

    # Display the plot in a canvas widget
    canvas = FigureCanvasTkAgg(fig, master=window)
    canvas_widget = canvas.get_tk_widget()
    canvas_widget.pack()


    # Create a Tkinter window
window = tk.Tk()
window.title("Data Distribution in Indian Cities")
window.configure(bg=bg_color)

# Styling Buttons
style = ttk.Style()
style.configure('TButton', font=('Arial', 10), borderwidth='1')
style.map('TButton', foreground=[('active', accent_color)], background=[('active', bg_color)])

# Create a button to trigger the plot function
plot_button = ttk.Button(window, text="Plot Data", command=plot_distribution)
plot_button.pack()

# Adding Buttons
refresh_button = ttk.Button(window, text="Refresh Data", command=plot_distribution)
refresh_button.pack(pady=10)

export_button = ttk.Button(window, text="Export Chart")
export_button.pack(pady=10)

# Calling the function to display the plot initially
plot_distribution()

# Start the Tkinter main loop
window.mainloop()


  fig, ax = plt.subplots(figsize=(10, 6))


In [3]:
'''Correlation between the number of LinkedIn followers and the number of job'''

'''Step 1 : '''
# Get the number of job listings of each company
num_job_listings = df['Company_Name'].value_counts()

'''step 2 :'''
# Get the number of LinkedIn followers of each company
num_followers = df.drop_duplicates(subset='Company_Name')[['Company_Name', 'LinkedIn_Followers']].set_index('Company_Name')

'''step 3 :'''
# Merge the two DataFrames
followers_vs_jobs = pd.merge(num_followers, num_job_listings, left_index=True, right_index=True)
followers_vs_jobs.columns = ['LinkedIn_Followers', 'Num_Job_Listings']

'''step 4 :'''
# Plot the correlation between the number of LinkedIn followers and the number of job listings of a company
plt.figure(figsize=(10, 6))
sns.scatterplot(x='LinkedIn_Followers', y='Num_Job_Listings', data=followers_vs_jobs)
plt.title('Correlation Between LinkedIn Followers and Number of Job Listings')
plt.xlabel('Number of LinkedIn Followers')
plt.ylabel('Number of Job Listings')
plt.xscale('log')
plt.yscale('log')
plt.show()

'''step 5 :'''
# Calculate the correlation coefficient
correlation = followers_vs_jobs.corr().iloc[0, 1]
correlation

  plt.show()


-0.31828242750821

# Additional details and explanations :

**Step 1 : Get the number of job listings of each company**


>This line of code aims to determine the frequency of each company's appearance within the Company_Name column of the DataFrame df. The value_counts() function effectively counts the number of occurrences for each unique value in the specified column. The resulting Series, stored in num_job_listings, holds the company names as its index and their corresponding job listing counts as values.

**Step 2 : Get the number of LinkedIn followers of each company**

> This step focuses on extracting unique companies and their corresponding LinkedIn follower counts from the DataFrame df. The drop_duplicates() function eliminates duplicate entries based on the Company_Name column, ensuring that only distinct companies remain. Next, the relevant columns, Company_Name and LinkedIn_Followers, are selected using double square brackets. Finally, the set_index() method sets the Company_Name column as the index of the resulting DataFrame, num_followers.

**Step 3 : Merge the two DataFrames**

> This step involves merging the two DataFrames, num_followers and num_job_listings, based on their common index, Company_Name. The pd.merge() function performs the merging operation, and the left_index=True and right_index=True parameters specify that the index values from both DataFrames should be used as the basis for the merge. The resulting DataFrame, followers_vs_jobs, contains the combined information from both DataFrames.


**Step 4 : Plot the correlation between the number of LinkedIn followers and the number of job listings of a company**

> This step creates a scatter plot to visualize the relationship between the number of LinkedIn followers and the number of job listings for each company. The plt.figure() function sets the figure size, while sns.scatterplot() generates the scatter plot with the specified data from followers_vs_jobs.

> The plot's title, axis labels, and x and y-axis scaling are customized using plt.title(), plt.xlabel(), plt.ylabel(), plt.xscale(), and plt.yscale(), respectively. The log scale is applied to both axes to better represent the wide range of values in the data. Finally, the plot is displayed using plt.show().

> sccter plot : Usually we use a scatter plot for to visualize the relationship between two variables, to identify trends and patterns, to detect outliers, to compare groups of data or to communicate insights clearly.
>> We choose to use a scatter plot here because it would allow us to visually assess the strength and direction of the correlation. If there is a positive correlation, we would expect to see a general upward trend in the data points, indicating that companies with more LinkedIn followers tend to have more job listings. Conversely, a negative correlation would be indicated by a downward trend, suggesting that companies with more LinkedIn followers tend to have fewer job listings.

**Step 5 : Calculate the correlation coefficient**

> This final step calculates the correlation coefficient between the number of LinkedIn followers and the number of job listings. The corr() method applied to followers_vs_jobs returns a Correlation matrix. The correlation coefficient for the desired variables, LinkedIn_Followers and Num_Job_Listings, is extracted using indexing and stored in the variable correlation.






In [4]:
'''The distribution of the number of applicants for the job listings'''

# Plot the distribution of the number of applicants for the job listings
plt.figure(figsize=(10, 6))
sns.histplot(df['Total_applicants'], kde=True)
plt.title('Distribution of Number of Applicants')
plt.xlabel('Number of Applicants')
plt.ylabel('Frequency')
plt.show()

  plt.show()


# Additional details and explanations :

>This code snippet creates a histogram and kernel density estimate (KDE) plot of the Total_applicants column in the DataFrame df. The histplot() function is used to create the plot, and the kde parameter is set to True to enable the KDE plot.

>A KDE plot is a smoothed version of a histogram. It is created by estimating the probability density function (PDF) of the data and then plotting the PDF. The PDF is a function that describes the probability of a data point occurring at a given value. The KDE plot superimposed on the histogram helps to smooth out the distribution and provides a more accurate representation of the underlying distribution of the data.

In [5]:
'''The top 10 companies with the highest average number of applicants'''

'''Step 1 :'''
# Get the average number of applicants for each company
avg_applicants_per_company = df.groupby('Company_Name')['Total_applicants'].mean()

'''Step 2 :'''
# Get the top 10 companies with the highest average number of applicants
top_10_companies_applicants = avg_applicants_per_company.sort_values(ascending=False).head(10)

'''Step 3 :'''
# Plot the top 10 companies with the highest average number of applicants
plt.figure(figsize=(10, 6))
sns.barplot(x=top_10_companies_applicants.values, y=top_10_companies_applicants.index)
plt.title('Top 10 Companies with Highest Average Number of Applicants')
plt.xlabel('Average Number of Applicants')
plt.ylabel('Company')
plt.show()

  plt.show()


# Additional details and explanations :

**Step 1: Get the average number of applicants for each company**

>This step calculates the average number of applicants for each company. The groupby() function is used to group the DataFrame by company name, and the mean() function is used to calculate the average number of applicants for each company. The result is a Series called avg_applicants_per_company.

**Step 2: Get the top 10 companies with the highest average number of applicants**

>This step sorts the avg_applicants_per_company Series in descending order and selects the top 10 companies. The sort_values() function is used to sort the Series, and the head() function is used to select the top 10 companies. The result is a Series called top_10_companies_applicants.

**Step 3: Plot the top 10 companies with the highest average number of applicants**

>Same as previous data visualization

In [6]:
'''Distribution of the data according to cities in India'''

'''Step 1'''
# Get the top 5 cities in India with the highest frequency from the 'Location' column
Top_5_cities_india = df['Location'].value_counts().sort_values(ascending=False).head(5)

# Print the top 5 cities and their frequencies
print(Top_5_cities_india)

'''Step 2'''
# Create a figure with a specific size
plt.figure(figsize=(10, 6))

# Create a bar plot with the top 5 cities on the x-axis and their frequencies on the y-axis
sns.barplot(x=Top_5_cities_india.index, y=Top_5_cities_india.values)

# Set labels for the x and y axes
plt.xlabel('Locations')
plt.ylabel('Frequency')

# Set the title of the plot
plt.title('Distribution of Data per City in India - Top 5')

# Display the bar plot
plt.show()


  plt.show()


 Maharashtra      204
 Telangana        166
 Karnataka        148
 Tamil Nadu        59
 Uttar Pradesh     54
Name: Location, dtype: int64


# Additional details and explanations :

**Step 1 : Get the top 5 cities in India with the highest frequency from the 'Location' column**
> This code analyzes the geographical distribution of job applications in the LinkedIn Tech Jobs Dataset, specifically focusing on India. It identifies the top five cities in India with the highest number of job applications. The code first counts the number of job applications from each city in the dataset using the value_counts() function. Then, it sorts the resulting Series in descending order to identify the top five cities. Finally, it displays the results using the print() function.

**Step 2 : Data visualization**
> Same as previous data visualization

In [7]:
'''Cumulative data graph in a pareto'''

'''Step 1'''
# Calculate the cumulative percentage data for city distribution
cumulative_data = np.cumsum(df['Location'].value_counts().sort_values(ascending=False)) / np.sum(df['Location'].value_counts().sort_values(ascending=False)) * 100

# Print the cumulative data
print(cumulative_data)

'''Step 2'''
# Create a figure with a specific size
plt.figure(figsize=(30, 6))

'''Step 3'''
# Get the city names in descending order of frequency
city_names = df['Location'].value_counts().sort_values(ascending=False).index

'''Step 4'''
# Create a bar plot with city names on the x-axis and cumulative data on the y-axis
plt.bar(city_names, cumulative_data)

# Set labels for the x and y axes
plt.xlabel('City')
plt.ylabel('Cumulative Data (%)')

# Set the title of the Pareto Chart
plt.title('Pareto Chart of Data')

# Display the Pareto Chart
plt.show()

 Maharashtra        25.216316
 Telangana          45.735476
 Karnataka          64.029666
 Tamil Nadu         71.322621
 Uttar Pradesh      77.997528
 Delhi              83.807169
 Haryana            87.639061
 West Bengal        90.976514
 Rajasthan          93.077874
 Gujarat            94.808405
 Kerala             96.044499
 Odisha             97.156984
 India              98.269468
 Madhya Pradesh     99.134734
 Andhra Pradesh     99.876391
 Punjab            100.000000
Name: Location, dtype: float64


  plt.show()


# Additional details and explanations :
**Step 1 : Calculate the cumulative percentage data for city distribution**
> Sort City Counts: The frequency of each city is first calculated using the value_counts() function. Then, the city counts are sorted in descending order to identify the cities with the highest number of job applications.

>Calculate Cumulative Sum: The np.cumsum() function is used to calculate the cumulative sum of the sorted city counts. This represents the total number of job applications accumulated up to that point.

>Normalize Cumulative Sum: The cumulative sum is then divided by the total number of job applications and multiplied by 100 to express the data as percentages. This results in a series of cumulative percentages representing the proportion of job applications accounted for by each city up to that point.

**Step 2 : Create a figure with a specific size**
>This step simply prints the calculated cumulative data series to the console. This provides a numerical representation of the distribution of job applications across cities.

**Step 3 : Get the city names in descending order of frequency**
>Extract City Names: The city names are extracted from the Location column of the DataFrame using the .index property.

>Sort City Names: The city names are sorted in descending order of frequency using the sort_values() function. This ensures that the cities with the highest number of job applications are processed first in the subsequent steps.

**Step 4: Data visualization**
> Same as previous data visualizations


In [8]:
'''India City Distribution'''

'''Step 1'''
# Create a subplot with a specific aspect ratio
fig, ax = plt.subplots(figsize=(6, 3), subplot_kw=dict(aspect="equal"))

'''Step 2'''
# Get the top 4 locations with the highest frequency from the 'Location' column
labels = df['Location'].value_counts().sort_values(ascending=False).head(4).index

# Get the corresponding counts for the top 4 locations
values = df['Location'].value_counts().sort_values(ascending=False).tolist()[:4]

'''Step 3'''
# Create a pie chart with specified parameters
wedges, texts, autotexts = ax.pie(values, wedgeprops=dict(width=0.65), startangle=-40, autopct='%.0f%%')

# Define properties for the annotation box
bbox_props = dict(boxstyle="square,pad=0.3", fc="w", ec="k", lw=0.72)

# Create a dictionary of keyword arguments (kw) for annotation
kw = dict(arrowprops=dict(arrowstyle="-"),
          bbox=bbox_props, zorder=0, va="center")

# Loop through the wedges (slices) in the pie chart
for i, p in enumerate(wedges):
    # Calculate the angle at the center of the wedge
    ang = (p.theta2 - p.theta1) / 2. + p.theta1
    y = np.sin(np.deg2rad(ang))
    x = np.cos(np.deg2rad(ang))

    # Determine the horizontal alignment of the annotation
    horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]

    # Define the connection style for the annotation arrow
    connectionstyle = f"angle,angleA=0,angleB={ang}"

    # Update the arrowprops and annotate the label
    kw["arrowprops"].update({"connectionstyle": connectionstyle})
    ax.annotate(labels[i], xy=(x, y), xytext=(1.35*np.sign(x), 1.4*y),
                horizontalalignment=horizontalalignment, **kw)

# Set the title of the pie chart
ax.set_title("India City Distribution")

# Display the pie chart
plt.show()

  plt.show()


# More details and explanations :

**Step 1: Figure and Axes Setup:**

>fig, ax = plt.subplots(figsize=(6, 3), subplot_kw=dict(aspect="equal")): This line creates a figure with a single subplot and sets its size to 6 inches in width and 3 inches in height. It also ensures that the aspect ratio of the plot is equal, preserving the circular shape of the donut chart.

**Step 2: Data Preparation:**

* labels = df['Location'].value_counts().sort_values(ascending=False).head(4).index: This line identifies the top four cities by sorting the Location column's value counts in descending order and selecting the top four labels. These labels represent the city names.

* values = df['Location'].value_counts().sort_values(ascending=False).tolist()[:4]: This line extracts the corresponding values for the top four cities based on the labels. These values represent the frequencies of each city.

**Step 3 : Creating the Donut Chart:**

> wedges, texts, autotexts= ax.pie(values, wedgeprops=dict(width=0.65), startangle=-40, autopct='%.0f%%'): This line generates the donut chart using the ax.pie() function. It takes the values list as input and sets the wedgeprops dictionary to define the width and appearance of the pie slices. The startangle parameter rotates the chart to start at the specified angle. The autopct parameter displays the percentage contribution of each slice inside the chart.
Label Placement Customization:

> bbox_props = dict(boxstyle="square,pad=0.3", fc="w", ec="k", lw=0.72): This line defines the properties of the label boxes, including their style, padding, fill color, edge color, and line width.

> kw = dict(arrowprops=dict(arrowstyle="-"), bbox=bbox_props, zorder=0, va="center"): This line defines the properties of the arrows connecting the labels to the respective pie slices. It sets the arrow style to a simple line, uses the previously defined bbox_props for the label

In [9]:
'''Ratio number candidat by employee (to know easy or difficult to access)'''

'''Step 1 :'''
# Group the data by company, location and profession, then calculate the total number of applicants
grouped_df = df.groupby(['Company_Name', 'Location', 'Designation']).agg({
    'Total_applicants': 'sum',
    'Employee_count': 'mean'
}).reset_index()

'''Step 2 :'''
# Calculate the ratio of candidates per employee
grouped_df['Applicants per employee ratio'] = grouped_df['Total_applicants'] / grouped_df['Employee_count']

'''Step 3 :'''
# Sort data by ratio
grouped_df = grouped_df.sort_values(by='Applicants per employee ratio', ascending=False)

'''Step 4 :'''
# Create graphic
plt.figure(figsize=(12, 8))
plt.bar(grouped_df['Company_Name'], grouped_df['Applicants per employee ratio'], color='skyblue')
plt.xlabel('Company')
plt.ylabel('Applicants per employee ratio')
plt.title('Applicants per employee ratio')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

# Plot graph
plt.show()

  plt.show()


# More details and explanations :

**Step 1 : Group the data by company, location and profession, then calculate the total number of applicants**
> This code segment groups the data in the DataFrame df by three columns: Company_Name, Location, and Designation. It then applies aggregation functions to calculate the desired statistics for each group.

* 'Total_applicants': 'sum': This calculates the total number of applicants for each combination of company, location, and designation.

* 'Employee_count': 'mean': This calculates the average number of employees for each combination of company, location, and designation.

* The reset_index() method is applied to the resulting grouped DataFrame to convert the group indices back into regular columns.

**Step 2 : Calculate the ratio of candidates per employee**

>This line calculates the ratio of applicants per employee for each group by dividing the total number of applicants by the average number of employees. This ratio reflects the competition for each position and can be used to identify companies or locations with high or low demand.

**Step 3 : Sort data by ratio**
> This line sorts the grouped DataFrame by the calculated Applicants per employee ratio in descending order. This arrangement allows for easy identification of companies with the highest ratios.

**Step 4 : Create graphic**
> same as previous data visualizations

In [10]:
'''Number of applicant by industry (to know the most popular industry)'''

# Calculate total number of applicants per industry
df_industries = df.groupby("Industry")["Total_applicants"].sum()

# Display bar chart
df_industries.plot(kind="bar")

# Add labels to axes
plt.xlabel("Industry")
plt.ylabel("Number of applicants")

# Display graph
plt.show()

  plt.show()


In [11]:
'''top 10 of skills which appear the most in the LinkedIn Tech Job Dataset'''

'''step 1'''

# First I print the names of of all the columns so I can copy and paste the names of the columns that I am interested in

column_names = df.columns.tolist()
print(column_names)

['Company_Name', 'Class', 'Designation', 'Location', 'Total_applicants', 'LinkedIn_Followers', 'Level', 'Involvement', 'Employee_count', 'Industry', 'PYTHON', 'C++', 'JAVA', 'HADOOP', 'SCALA', 'FLASK', 'PANDAS', 'SPARK', 'NUMPY', 'PHP', 'SQL', 'MYSQL', 'CSS', 'MONGODB', 'NLTK', 'TENSORFLOW', 'LINUX', 'RUBY', 'JAVASCRIPT', 'DJANGO', 'REACT', 'REACTJS', 'AI', 'UI', 'TABLEAU', 'NODEJS', 'EXCEL', 'POWER BI', 'SELENIUM', 'HTML', 'ML']


In [12]:
'''step 2'''

skills = ['PYTHON', 'C++', 'JAVA', 'HADOOP', 'SCALA', 'FLASK', 'PANDAS', 'SPARK', 'NUMPY', 'PHP', 'SQL', 'MYSQL', 'CSS', 'MONGODB', 'NLTK', 'TENSORFLOW', 'LINUX', 'RUBY', 'JAVASCRIPT', 'DJANGO', 'REACT', 'REACTJS', 'AI', 'UI', 'TABLEAU', 'NODEJS', 'EXCEL', 'POWER BI', 'SELENIUM', 'HTML', 'ML']

colum_sum = df[skills].sum()
#calculates the percentage of each skill
skill_percentages = (colum_sum / colum_sum.sum()) * 100

#print(skill_percentages)

'''step 3'''

# Create a pie chart of the top 10 skill percentages
plt.figure(figsize=(10, 10))
plt.pie(
    skill_percentages.values[:10],  # Considering only the top 10 skills
    labels=skill_percentages.index[:10],
    autopct="%1.1f%%",  # Display percentage values
)

plt.title("Percentage of Top 10 Skills in the LinkedIn Tech Jobs Dataset")

# Create a legend for the pie chart
plt.legend(skill_percentages.index[:10], loc="upper right", bbox_to_anchor=(1.05, 1.05), title="Skills")

plt.show()



  plt.show()


# Additional details and explanations :

**Step 1**
> prints the names of all of the columns in the DataFrame. This is useful because it allows us to see which columns contain the data that we are interested in.

**Step 2 : Calculates the percentage of each skill**

> This step calculates the percentage of each skill in the LinkedIn Tech Jobs Dataset. The percentage of each skill is calculated by dividing the number of applicants for that skill by the total number of applicants. The sum() function is used to sum the number of applicants for each skill, and the div() function is used to divide the sum of the number of applicants for each skill by the total number of applicants.

**Step 3 : Create a pie chart of the top 10 skill percentages**
> This step creates a pie chart of the percentage of the top 10 skills in the LinkedIn Tech Jobs Dataset. The pie() function is used to create the pie chart, and the figsize() function is used to set the size of the figure. The autopct parameter is used to display the percentage values on the pie chart. The title() function is used to set the title of the pie chart, and the legend() function is used to create a legend for the pie chart.

In [13]:
'''Word Cloud of Job Titles in LinkedIn Tech Jobs Dataset'''

'''Step 1'''

!pip install wordcloud

Collecting wordcloud
  Downloading wordcloud-1.9.2-cp39-cp39-win_amd64.whl.metadata (3.4 kB)
Downloading wordcloud-1.9.2-cp39-cp39-win_amd64.whl (153 kB)
   ---------------------------------------- 153.3/153.3 kB 2.3 MB/s eta 0:00:00
Installing collected packages: wordcloud
Successfully installed wordcloud-1.9.2


In [14]:
'''step 2'''

from wordcloud import WordCloud
# Assuming 'Job_Title' is the column with job names

'''step 3'''
#extracts the job titles
job_df = df['Designation']
all_job_titles = ' '.join(job_df)

'''step 4'''

wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_job_titles)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()


  plt.show()


# Addditional details and explanations :
**step 1 : Install wordcloud**
> The command !pip install wordcloud is used to install the WordCloud package from the Python Package Index (PyPI). WordCloud is a Python package that allows us to generate word clouds, which are visual representations of text data. Word clouds are often used to display the frequency of words in a text dataset.

**step 2 : Importing workcloud**
>This step imports the WordCloud library. The WordCloud library is used to generate word clouds, which are visual representations of text data. Word clouds are often used to display the frequency of words in a text dataset.

**step 3 : Extracts the job titles**
>This step extracts the job titles from the LinkedIn Tech Jobs Dataset. The job titles are extracted from the 'Designation' column of the DataFrame and stored in a new variable called 'job_df'. Subsequently, a string containing all of the job titles in the LinkedIn Tech Jobs Dataset is created by joining the job titles together with spaces and stored in a new variable called 'all_job_titles'.

**step 4 : Visualization**
>This step creates a WordCloud object using the imported WordCloud library and passes it the parameters for width, height, and background color. The generate() method is then called on the WordCloud object, passing it the 'all_job_titles' string, which generates a word cloud from the extracted job titles. Next, a figure is created using the figure() function from the matplotlib library with the specified dimensions of 10 inches in width and 5 inches in height. The generated word cloud is then displayed using the imshow() function from the matplotlib library, setting the interpolation parameter to 'bilinear' to improve image quality. Finally, the axis labels are turned off using the axis('off') function from the matplotlib library, and the figure is displayed using the show() function from the matplotlib library.


