## Worlds Best Employers

### Faigy Mandelbaum
1/18/2023

My goals for this project are the following:
- Write efficient code to help users access the data easily.
- Save time and space while running the code.

Specific goals for the "Worlds Best Employers" dataset:
- Give the user information about employers in each country in an optimized way.
- Organize the different industries.
- Look up the ranks for the companies in the quickest way.

In [18]:
# Reading in the csv as a list of lists
import csv
import pandas as pd
with open ("Worlds Best Employers.csv", encoding = 'UTF-8') as f:
    reader = csv.reader(f)
    rows = list(reader)
header = rows[0]
rows = rows[1:]


In [19]:
print (header)

['RANK', 'NAME', 'INDUSTRIES', 'COUNTRY/TERRITORY', 'EMPLOYEES']


In [20]:
print (rows[:3])

[['1.', 'Samsung Electronics', 'Semiconductors, Electronics, Electrical Engineering, Technology Hardware & Equipment', 'South Kore', '266,673'], ['2.', 'Microsoft', 'IT, Internet, Software & Services', 'United States', '221,000'], ['3.', 'IBM', 'Semiconductors, Electronics, Electrical Engineering, Technology Hardware & Equipment', 'United States', '250,000']]


In [21]:
# Creating a dictionary with the country as a key and the rest of the information as values
employees_by_country = {}
for row in rows:
    if row[3] in employees_by_country:
        employees_by_country[row[3]] += [row[:3] + row[4:]]
    else:
        employees_by_country[row[3]] = [row[:3] + row[4:]]


In [22]:
print (employees_by_country['United States'])

[['2.', 'Microsoft', 'IT, Internet, Software & Services', '221,000'], ['3.', 'IBM', 'Semiconductors, Electronics, Electrical Engineering, Technology Hardware & Equipment', '250,000'], ['4.', 'Alphabet', 'IT, Internet, Software & Services', '156,500'], ['5.', 'Apple', 'Semiconductors, Electronics, Electrical Engineering, Technology Hardware & Equipment', '154,000'], ['6.', 'Delta Air Lines', 'Transportation and Logistics', '80,000'], ['7.', 'Costco Wholesale', 'Retail and Wholesale', '288,000'], ['8.', 'Adobe', 'IT, Internet, Software & Services', '25,988'], ['9.', 'Southwest Airlines', 'Transportation and Logistics', '55,093'], ['10.', 'Dell Technologies', 'Semiconductors, Electronics, Electrical Engineering, Technology Hardware & Equipment', '133,000'], ['11.', 'Lockheed Martin', 'Aerospace & Defense', '114,000'], ['12.', 'Cisco Systems', 'Semiconductors, Electronics, Electrical Engineering, Technology Hardware & Equipment', '79,500'], ['14.', 'Amazon', 'IT, Internet, Software & Servi

We used a for loop in order to create a dictionary with the countries as keys, the values are lists of all the information of the companies in that country. This for loop saves time for all lookups from now on. Now we can find all the information by country just with calling the key.

In [23]:
# Creating a set of all industries
industries = set()
for row in rows:
    industries.add(row[2])

In [24]:
industries

{'Aerospace & Defense',
 'Automotive (Automotive and Suppliers)',
 'Banking and Financial Services',
 'Clothing, Shoes, Sports Equipment',
 'Conglomerate',
 'Construction, Oil & Gas Operations, Mining and Chemicals',
 'Drugs & Biotechnology',
 'Engineering, Manufacturing',
 'Food, Soft Beverages, Alcohol & Tobacco',
 'Health Care Equipment & Services',
 'Healthcare & Social',
 'IT, Internet, Software & Services',
 'Media & Advertising',
 'Packaged Goods',
 'Retail and Wholesale',
 'Semiconductors, Electronics, Electrical Engineering, Technology Hardware & Equipment',
 'Telecommunications Services, Cable Supplier',
 'Transportation and Logistics',
 'Travel & Leisure      ',
 'Utilities'}

We created a set of all industries to make it easier to check up all the different types and that way save time and space.

In [25]:
sorted_company_names = []
for row in rows:
    sorted_company_names.append(row[1])
sorted_company_names = sorted(sorted_company_names)    

In [26]:
# Binary search on the rank with a list of lists
def list_binsearch_lookup(sorted_names, target_name):
    range_start = 0
    range_end = len(sorted_names) - 1
    while range_start < range_end:
        range_middle = (range_end + range_start) // 2
        name = sorted_names[range_middle]
        if name == target_name:
            for row in rows:
                if row[1] == name:
                    return row[0]
        elif name < target_name:
            range_start = range_middle + 1
        else:
            range_end = range_middle - 1
    if sorted_names[range_start] != target_name:
        return False
    else:
        for row in rows:
            if row[1] == target_name:
                return row[0]  

In [27]:
list_binsearch_lookup(sorted_company_names, 'Apple')

'5.'

In [28]:
print (rows[4])

['5.', 'Apple', 'Semiconductors, Electronics, Electrical Engineering, Technology Hardware & Equipment', 'United States', '154,000']


In [29]:

employees_df = pd.DataFrame(rows, columns = header)

In [30]:
employees_df.head()

Unnamed: 0,RANK,NAME,INDUSTRIES,COUNTRY/TERRITORY,EMPLOYEES
0,1.0,Samsung Electronics,"Semiconductors, Electronics, Electrical Engine...",South Kore,266673
1,2.0,Microsoft,"IT, Internet, Software & Services",United States,221000
2,3.0,IBM,"Semiconductors, Electronics, Electrical Engine...",United States,250000
3,4.0,Alphabet,"IT, Internet, Software & Services",United States,156500
4,5.0,Apple,"Semiconductors, Electronics, Electrical Engine...",United States,154000


In [31]:
# Binary search on the rank with a pandas dataframe
def dataframe_binsearch_lookup(df, sorted_names, target_name):
    range_start = 0
    range_end = len(sorted_names) - 1
    while range_start < range_end:
        range_middle = (range_end + range_start) // 2
        name = sorted_names[range_middle]
        if name == target_name:
            return df['RANK'][df['NAME'] == target_name]
        elif name < target_name:
            range_start = range_middle + 1
        else:
            range_end = range_middle - 1
    if sorted_names[range_start] != target_name:
        return False
    else:
        return df['RANK'][df['NAME'] == target_name]  

In [32]:
dataframe_binsearch_lookup(employees_df, sorted_company_names, 'IBM')

2    3.
Name: RANK, dtype: object

In [33]:
import time

In [34]:
start=time.time()
dataframe_binsearch_lookup(employees_df, sorted_company_names, 'IBM')
end=time.time()
dataframe_time=end-start

start=time.time()
list_binsearch_lookup(sorted_company_names, 'IBM')
end=time.time()
list_time=end-start
print(dataframe_time, list_time)

0.0 0.0


After running the above codes and timing them, we find that looping through a list is faster than a data frame.(Even though the data frame function is shorter.)

In conclusion:
Now the user could easily get information about employers in his country by just typing in the country name.
He can also find out the rank for a company by specifying the one he wants to know about.
There is a clear list of all industries.

### All functions in a class:

In [35]:
import csv
import pandas as pd

class Employee:

    def __init__(self, employee_information):
        self.employee_path = employee_information
        self.rows = rows
        self.header = header
        self.employees_by_country = employees_by_country
        self.industries = industries
        self.sorted_company_names = sorted_company_names
        self.employees_df = pd.DataFrame(self.rows, columns = self.header)

    # A function that reads in a csv file as a list of lists
    def read_csv(self):
        with open (self.employee_path, encoding = 'UTF-8') as f:
            reader = csv.reader(f)
            rows = list(reader)
        self.header = rows[0]
        self.rows = rows[1:] 

    # Instantiating a dictionary with the countries as keys and the rest of the information as values. 
    def create_employees_by_country_dict(self):
        for row in self.rows:
            if row[3] in self.employees_by_country:
                self.employees_by_country[row[3]] += [row[:3] + row[4:]]
            else:
                self.employees_by_country[row[3]] = [row[:3] + row[4:]] 

    # Creating a set of industries to enable customers to see the different types in a organized list.
    def create_industries_set(self):
        industries = set()
        for row in rows:
           industries.add(row[2])  

    # Sorting the company names in a list
    def sort_companies(self):
        self.sorted_company_names = []
        for row in self.rows:
            self.sorted_company_names.append(row[1])
        self.sorted_company_names = sorted(self.sorted_company_names) 

    # This function does a binary search on the companies and returns the rank of the company that was given.
    def list_binsearch_lookup(self, target_name, sorted_names):
        range_start = 0
        range_end = len(sorted_names) - 1
        while range_start < range_end:
            range_middle = (range_end + range_start) // 2
            name = sorted_names[range_middle]
            if name == target_name:
                for row in self.rows:
                    if row[1] == name:
                        return row[0]
            elif name < target_name:
                range_start = range_middle + 1
            else:
                range_end = range_middle - 1
        if sorted_names[range_start] != target_name:
            return False
        else:
            for row in self.rows:
                if row[1] == target_name:
                    return row[0]

    # This function does the same as above (binary search on the rank) with a pandas dataframe.
    def dataframe_binsearch_lookup(self, df, sorted_names, target_name):
        range_start = 0
        range_end = len(sorted_names) - 1
        while range_start < range_end:
            range_middle = (range_end + range_start) // 2
            name = sorted_names[range_middle]
            if name == target_name:
                return df['RANK'][df['NAME'] == target_name]
            elif name < target_name:
                range_start = range_middle + 1
            else:
                range_end = range_middle - 1
        if sorted_names[range_start] != target_name:
            return False
        else:
            return df['RANK'][df['NAME'] == target_name] 