#                         Module # 02 (Data Preprocessing)

## Pandas
    Pandas is a Python library used for data manipulation and analysis. It provides easy-to-use data structures and functions for working with structured data, such as tables, and offers functionalities for cleaning, transforming, and analyzing data.
    The key data structures in Pandas are:

    Series: A one-dimensional labeled array that can hold data of any type. It is similar to a column in a spreadsheet or a single column of a DataFrame.

    DataFrame: A two-dimensional labeled data structure with columns of potentially different data types. It is similar to a table in a relational database or a spreadsheet.

The line "import pandas as pd" is used to import the pandas library in Python and assign it the alias "pd". This allows you to refer to the pandas functions and objects using the "pd" prefix, making it easier to work with the library in your code.

In [1]:
import pandas as pd

The line of code "f500 = pd.read_csv('f500.csv')" reads the data from a file called 'f500.csv' and stores it in a variable named "f500". This data represents information about companies listed in the Fortune 500 rankings. By using this code, we can easily access and analyze the data to gain insights about these companies, such as their revenues, profits, and other relevant details.

In [2]:
f500=pd.read_csv('f500.csv')
f500

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
0,Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
1,State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
2,Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
3,China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
4,Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,Teva Pharmaceutical Industries,496,21903,11.5,329.0,92890,-79.3,Yitzhak Peterburg,Pharmaceuticals,Health Care,0,Israel,"Petach Tikva, Israel",http://www.tevapharm.com,1,56960,33337
496,New China Life Insurance,497,21796,-13.3,743.9,100609,-45.6,Wan Feng,"Insurance: Life, Health (stock)",Financials,427,China,"Beijing, China",http://www.newchinalife.com,2,54378,8507
497,Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111
498,TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006


The code "f500.head()" is used to display the first few rows of the "f500" dataframe. It allows us to quickly preview the data and see the top records in the dataframe.

In [3]:
f500.head()

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
0,Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
1,State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
2,Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
3,China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
4,Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210


The code "f500.tail()" is used to display the last few rows of the "f500" dataframe. It allows us to quickly preview the end records in the dataframe.

In [4]:
f500.tail()

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
495,Teva Pharmaceutical Industries,496,21903,11.5,329.0,92890,-79.3,Yitzhak Peterburg,Pharmaceuticals,Health Care,0,Israel,"Petach Tikva, Israel",http://www.tevapharm.com,1,56960,33337
496,New China Life Insurance,497,21796,-13.3,743.9,100609,-45.6,Wan Feng,"Insurance: Life, Health (stock)",Financials,427,China,"Beijing, China",http://www.newchinalife.com,2,54378,8507
497,Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111
498,TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006
499,AutoNation,500,21609,3.6,430.5,10060,-2.7,Michael J. Jackson,Specialty Retailers,Retailing,0,USA,"Fort Lauderdale, FL",http://www.autonation.com,12,26000,2310


The code "f500.columns" is used to retrieve the column labels or headers of the "f500" dataframe.

In [5]:
f500.columns

Index(['company', 'rank', 'revenues', 'revenue_change', 'profits', 'assets',
       'profit_change', 'ceo', 'industry', 'sector', 'previous_rank',
       'country', 'hq_location', 'website', 'years_on_global_500_list',
       'employees', 'total_stockholder_equity'],
      dtype='object')

The output of "f500.info()" includes the following information:

* The total number of entries (rows) in the dataframe.
* The data type of each column, such as integer, float, or object.
* The number of non-null values in each column, which indicates if there are any missing values.
* The memory usage of the dataframe.

This summary helps in understanding the overall structure of the dataframe and assessing the data quality, such as identifying missing values or determining appropriate data types for analysis.

In [6]:
f500.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   company                   500 non-null    object 
 1   rank                      500 non-null    int64  
 2   revenues                  500 non-null    int64  
 3   revenue_change            498 non-null    float64
 4   profits                   499 non-null    float64
 5   assets                    500 non-null    int64  
 6   profit_change             436 non-null    float64
 7   ceo                       500 non-null    object 
 8   industry                  500 non-null    object 
 9   sector                    500 non-null    object 
 10  previous_rank             500 non-null    int64  
 11  country                   500 non-null    object 
 12  hq_location               500 non-null    object 
 13  website                   500 non-null    object 
 14  years_on_g

The code "f500.describe(include='all')" is used to generate descriptive statistics for the "f500" dataframe, including both numeric and non-numeric columns. By using the parameter "include='all'", it ensures that statistics for all columns, regardless of their data type, are included in the output.

The output of "f500.describe(include='all')" provides the following statistics for each column:
- Count: The number of non-null values in the column.
- Unique: The number of distinct values in the column.
- Top: The most frequently occurring value in the column.
- Frequency: The frequency of the most frequently occurring value.
- Mean: The arithmetic mean (average) of the values in the column.
- Std: The standard deviation of the values in the column.
- Min: The minimum value in the column.
- 25%, 50%, 75%: The quartiles of the values in the column.
- Max: The maximum value in the column.

These statistics help in understanding the distribution, central tendency, and spread of the numeric columns, while providing insights into the unique values and frequency of occurrence in non-numeric columns. It is useful for initial data exploration and gaining a summary view of the dataset.

In [7]:
f500.describe(include='all')

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
count,500,500.0,500.0,498.0,499.0,500.0,436.0,500,500,500,500.0,500,500,500,500.0,500.0,500.0
unique,500,,,,,,,500,58,21,,34,235,500,,,
top,Walmart,,,,,,,C. Douglas McMillon,Banks: Commercial and Savings,Financials,,USA,"Beijing, China",http://www.walmart.com,,,
freq,1,,,,,,,1,51,118,,132,56,1,,,
mean,,250.5,55416.358,4.538353,3055.203206,243632.3,24.152752,,,,222.134,,,,15.036,133998.3,30628.076
std,,144.481833,45725.478963,28.549067,5171.981071,485193.7,437.509566,,,,146.941961,,,,7.932752,170087.8,43642.576833
min,,1.0,21609.0,-67.3,-13038.0,3717.0,-793.7,,,,0.0,,,,1.0,328.0,-59909.0
25%,,125.75,29003.0,-5.9,556.95,36588.5,-22.775,,,,92.75,,,,7.0,42932.5,7553.75
50%,,250.5,40236.0,0.55,1761.6,73261.5,-0.35,,,,219.5,,,,17.0,92910.5,15809.5
75%,,375.25,63926.75,6.975,3954.0,180564.0,17.7,,,,347.25,,,,23.0,168917.2,37828.5


The code "f500.dtypes" is used to display the data types of each column in the "f500" dataframe.

In [8]:
f500.dtypes

company                      object
rank                          int64
revenues                      int64
revenue_change              float64
profits                     float64
assets                        int64
profit_change               float64
ceo                          object
industry                     object
sector                       object
previous_rank                 int64
country                      object
hq_location                  object
website                      object
years_on_global_500_list      int64
employees                     int64
total_stockholder_equity      int64
dtype: object

In [9]:
f500['company'] #to access the "company" column of the "f500" dataframe.

0                             Walmart
1                          State Grid
2                       Sinopec Group
3            China National Petroleum
4                        Toyota Motor
                    ...              
495    Teva Pharmaceutical Industries
496          New China Life Insurance
497         Wm. Morrison Supermarkets
498                               TUI
499                        AutoNation
Name: company, Length: 500, dtype: object

In [10]:
f500[['company','ceo']] #to access the "company" and "ceo" column of the "f500" dataframe.

Unnamed: 0,company,ceo
0,Walmart,C. Douglas McMillon
1,State Grid,Kou Wei
2,Sinopec Group,Wang Yupu
3,China National Petroleum,Zhang Jianhua
4,Toyota Motor,Akio Toyoda
...,...,...
495,Teva Pharmaceutical Industries,Yitzhak Peterburg
496,New China Life Insurance,Wan Feng
497,Wm. Morrison Supermarkets,David T. Potts
498,TUI,Friedrich Joussen


In [11]:
f500.drop(1) # remove the row with index 1 from the dataframe

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
0,Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
2,Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
3,China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
4,Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210
5,Volkswagen,6,240264,1.5,5937.3,432116,,Matthias Muller,Motor Vehicles and Parts,Motor Vehicles & Parts,7,Germany,"Wolfsburg, Germany",http://www.volkswagen.com,23,626715,97753
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,Teva Pharmaceutical Industries,496,21903,11.5,329.0,92890,-79.3,Yitzhak Peterburg,Pharmaceuticals,Health Care,0,Israel,"Petach Tikva, Israel",http://www.tevapharm.com,1,56960,33337
496,New China Life Insurance,497,21796,-13.3,743.9,100609,-45.6,Wan Feng,"Insurance: Life, Health (stock)",Financials,427,China,"Beijing, China",http://www.newchinalife.com,2,54378,8507
497,Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111
498,TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006


The code "f500['country'].value_counts()" is used to count the occurrences of each unique value in the 'country' column of the 'f500' dataframe.

In [12]:
f500["country"].value_counts()

USA             132
China           109
Japan            51
Germany          29
France           29
Britain          24
South Korea      15
Netherlands      14
Switzerland      14
Canada           11
Spain             9
Australia         7
Brazil            7
India             7
Italy             7
Taiwan            6
Russia            4
Ireland           4
Singapore         3
Sweden            3
Mexico            2
Malaysia          1
Thailand          1
Belgium           1
Norway            1
Luxembourg        1
Indonesia         1
Denmark           1
Saudi Arabia      1
Finland           1
Venezuela         1
Turkey            1
U.A.E             1
Israel            1
Name: country, dtype: int64

## iloc and loc

In pandas, iloc is a method used for indexing and selecting data from a DataFrame based on its integer position. It allows you to access rows and columns using integer-based indexing rather than label-based indexing.

The syntax for using iloc is as follows:

    df.iloc[row_indexer, column_indexer]


In [13]:
f500.iloc[4] # selects the row at index 4 from the DataFrame

company                                     Toyota Motor
rank                                                   5
revenues                                          254694
revenue_change                                       7.7
profits                                          16899.3
assets                                            437575
profit_change                                      -12.3
ceo                                          Akio Toyoda
industry                        Motor Vehicles and Parts
sector                            Motor Vehicles & Parts
previous_rank                                          8
country                                            Japan
hq_location                                Toyota, Japan
website                     http://www.toyota-global.com
years_on_global_500_list                              23
employees                                         364445
total_stockholder_equity                          157210
Name: 4, dtype: object

In [14]:
f500.iloc[4,-4] # selects the value at the intersection of the 5th row and the 4th column from the DataFrame

'http://www.toyota-global.com'

In [15]:
f500.iloc[[5,13],[0,4]] # retrieves the values in the 1st and 5th columns for the 6th and 14th companies in the DataFrame. 

Unnamed: 0,company,profits
5,Volkswagen,5937.3
13,CVS Health,5317.0


The **loc** indexer in pandas is used to access rows and columns in a DataFrame using label-based indexing.

With loc, you can specify the row labels and column labels to select specific data from a DataFrame.

For example, f500.loc[4, 'company'] retrieves the value in the 'company' column for the row with label 4. It selects a specific cell based on the label of the row and the label of the column.

In [16]:
f500.loc[4,"company"] # retrieves the value in the 'company' column for the row with label 4.

'Toyota Motor'

In [17]:
f500.loc[4,['company','ceo']]# retrieves the values in the 'company' and 'ceo' columns for the row with label 4 in the DataFrame

company    Toyota Motor
ceo         Akio Toyoda
Name: 4, dtype: object

In [18]:
f500.loc[:4,'company':'ceo'] # retrieves rows up to index 4 (inclusive) and columns from 'company' to 'ceo' (inclusive).

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo
0,Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon
1,State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei
2,Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu
3,China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua
4,Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda


## Boolean indexing
    Boolean indexing allows you to select rows or columns based on a condition or set of conditions. It involves using a boolean expression to filter the data. For example, df[condition] returns a DataFrame where only the rows that satisfy the specified condition are included.

The code india_bool = f500['country'] == 'India' creates a boolean series india_bool that checks whether the values in the 'country' column of the DataFrame f500 are equal to 'India'. Each element of the boolean series will be True if the corresponding row has 'India' in the 'country' column, and False otherwise.

In [19]:
india_bool=f500['country']=='India'
f500[india_bool]

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
167,Indian Oil,168,53562,-2.1,2960.0,42132,72.7,Sanjiv Singh,Petroleum Refining,Energy,161,India,"New Delhi, India",http://www.iocl.com,23,34999,15724
202,Reliance Industries,203,46931,8.0,4458.9,108856,5.6,Mukesh D. Ambani,Petroleum Refining,Energy,215,India,"Mumbai, India",http://www.ril.com,14,140483,40614
216,State Bank of India,217,44533,6.8,36.0,530590,-98.1,Arundhati Bhattacharya,Banks: Commercial and Savings,Financials,232,India,"Mumbai, India",http://www.sbi.co.in,12,278872,33450
246,Tata Motors,247,40329,-4.2,1111.6,42162,-34.0,Guenter Butschek,Motor Vehicles and Parts,Motor Vehicles & Parts,226,India,"Mumbai, India",http://www.tatamotors.com,8,79558,8942
294,Rajesh Exports,295,36114,43.1,185.8,3717,13.8,Rajesh J. Mehta,Trading,Wholesalers,423,India,"Bengaluru, India",http://www.rajeshindia.com,2,328,868
359,Bharat Petroleum,360,30316,4.2,1300.5,16801,6.7,D. Rajkumar,Petroleum Refining,Energy,358,India,"Mumbai, India",http://www.bharatpetroleum.in,14,13395,4747
383,Hindustan Petroleum,384,28166,-2.3,1228.1,12370,63.4,Mukesh Kumar Surana,Petroleum Refining,Energy,367,India,"Mumbai, India",http://www.hindustanpetroleum.com,14,10422,3245


The code "f500.loc[india_bool, ['company', 'ceo']]" returns a DataFrame that includes the 'company' and 'ceo' columns for the rows where the 'country' is 'India' in the original DataFrame f500.

In [20]:
f500.loc[india_bool,['company','ceo']]

Unnamed: 0,company,ceo
167,Indian Oil,Sanjiv Singh
202,Reliance Industries,Mukesh D. Ambani
216,State Bank of India,Arundhati Bhattacharya
246,Tata Motors,Guenter Butschek
294,Rajesh Exports,Rajesh J. Mehta
359,Bharat Petroleum,D. Rajkumar
383,Hindustan Petroleum,Mukesh Kumar Surana


## Logical Operator in Pandas
   In pandas, you can use logical operators such as & (AND), | (OR), and ~ (NOT) to combine or negate multiple boolean conditions.

* The & operator performs element-wise AND between two or more boolean conditions. It returns True only when both conditions are True.
* The | operator performs element-wise OR between two or more boolean conditions. It returns True if at least one of the conditions is True.
* The ~ operator negates a boolean condition. It returns the opposite of the condition.

The expression f500[(f500['country']=='USA') & (f500['sector']=='Technology')] uses logical operators (& and ==) to create a boolean condition that filters the DataFrame f500 based on two conditions: the country is 'USA' and the sector is 'Technology'.

In [21]:
f500[(f500['country']=='USA') & (f500['sector']=='Technology')]

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
8,Apple,9,215639,-7.7,45687.0,321686,-14.4,Timothy D. Cook,"Computers, Office Equipment",Technology,9,USA,"Cupertino, CA",http://www.apple.com,15,116000,128249
25,Amazon.com,26,135987,27.1,2371.0,83402,297.8,Jeffrey P. Bezos,Internet Services and Retailing,Technology,44,USA,"Seattle, WA",http://www.amazon.com,9,341400,19285
64,Alphabet,65,90272,20.4,19478.0,167497,19.1,Larry Page,Internet Services and Retailing,Technology,94,USA,"Mountain View, CA",http://www.abc.xyz,9,72053,139036
68,Microsoft,69,85320,-8.8,16798.0,193694,37.8,Satya Nadella,Computer Software,Technology,63,USA,"Redmond, WA",http://www.microsoft.com,20,114000,71997
80,IBM,81,79919,-3.1,11872.0,117470,-10.0,Virginia M. Rometty,Information Technology Services,Technology,82,USA,"Armonk, NY",http://www.ibm.com,23,414400,18246
123,Dell Technologies,124,64806,18.1,-1672.0,118206,,Michael S. Dell,"Computers, Office Equipment",Technology,0,USA,"Round Rock, TX",http://www.delltechnologies.com,17,138000,13243
143,Intel,144,59387,7.3,10316.0,113327,-9.7,Brian M. Krzanich,Semiconductors and Other Electronic Components,Technology,158,USA,"Santa Clara, CA",http://www.intel.com,23,106000,66226
180,Hewlett Packard Enterprise,181,50123,,3161.0,79679,,Margaret C. Whitman,Information Technology Services,Technology,0,USA,"Palo Alto, CA",http://www.hpe.com,1,195000,31448
186,Cisco Systems,187,49247,0.2,10739.0,121652,19.6,Charles H. Robbins,Network and Other Communications Equipment,Technology,183,USA,"San Jose, CA",http://www.cisco.com,18,73700,63586
193,HP,194,48238,-53.3,2496.0,29010,-45.2,Dion J. Weisler,"Computers, Office Equipment",Technology,48,USA,"Palo Alto, CA",http://www.hp.com,23,49000,-3889


The expression f500[(f500['country']=='USA') & ~ (f500['sector']=='Technology')] uses logical operators (~ & and ==) to create a boolean condition that filters the DataFrame f500 based on two conditions: the country is 'USA' and the sector is not 'Technology'.

In [22]:
f500[(f500["country"] == "USA") & ~(f500["sector"] == "Technology")]

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
0,Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
7,Berkshire Hathaway,8,223604,6.1,24074.0,620854,,Warren E. Buffett,Insurance: Property and Casualty (Stock),Financials,11,USA,"Omaha, NE",http://www.berkshirehathaway.com,21,367700,283001
9,Exxon Mobil,10,205004,-16.7,7840.0,330314,-51.5,Darren W. Woods,Petroleum Refining,Energy,6,USA,"Irving, TX",http://www.exxonmobil.com,23,72700,167325
10,McKesson,11,198533,3.1,5070.0,60969,124.5,John H. Hammergren,Wholesalers: Health Care,Wholesalers,12,USA,"San Francisco, CA",http://www.mckesson.com,23,64500,11095
12,UnitedHealth Group,13,184840,17.7,7017.0,122810,20.7,Stephen J. Hemsley,Health Care: Insurance and Managed Care,Health Care,17,USA,"Minnetonka, MN",http://www.unitedhealthgroup.com,21,230000,38274
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
480,U.S. Bancorp,481,22744,5.8,5888.0,445964,0.2,Andrew J. Cecere,Banks: Commercial and Savings,Financials,490,USA,"Minneapolis, MN",http://www.usbank.com,12,71191,47298
482,Aflac,483,22559,8.1,2659.0,129819,5.0,Daniel P. Amos,"Insurance: Life, Health (stock)",Financials,0,USA,"Columbus, GA",http://www.aflac.com,10,10212,20482
488,Sears Holdings,489,22138,-12.0,-2221.0,9362,,Edward S. Lampert,General Merchandisers,Retailing,425,USA,"Hoffman Estates, IL",http://www.searsholdings.com,23,140000,-3824
491,Dollar General,492,21987,7.9,1251.1,11672,7.4,Todd J. Vasos,Specialty Retailers,Retailing,0,USA,"Goodlettsville, TN",http://www.dollargeneral.com,1,121000,5406


In [23]:
# retrieve data where country column contain germany only.
germany_df = f500[f500["country"] == "Germany"] 
germany_df.head()

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
5,Volkswagen,6,240264,1.5,5937.3,432116,,Matthias Muller,Motor Vehicles and Parts,Motor Vehicles & Parts,7,Germany,"Wolfsburg, Germany",http://www.volkswagen.com,23,626715,97753
16,Daimler,17,169483,2.2,9428.4,256262,0.9,Dieter Zetsche,Motor Vehicles and Parts,Motor Vehicles & Parts,16,Germany,"Stuttgart, Germany",http://www.daimler.com,23,282488,61116
33,Allianz,34,122196,-0.6,7611.5,932091,3.7,Oliver Bate,Insurance: Property and Casualty (Stock),Financials,34,Germany,"Munich, Germany",http://www.allianz.com,23,140253,71020
51,BMW Group,52,104130,1.8,7589.4,198835,7.4,Harald Kruger,Motor Vehicles and Parts,Motor Vehicles & Parts,51,Germany,"Munich, Germany",http://www.bmwgroup.com,23,124729,49682
65,Siemens,66,88419,0.9,6050.5,141271,-27.4,Josef Kaeser,Industrial Machinery,Industrials,71,Germany,"Munich, Germany",http://www.siemens.com,23,351000,38444


The code written below creates a new DataFrame germany_df_sort_employees by sorting the germany_df DataFrame based on the values in the "employees" column in descending order.

In [24]:
germany_df_sort_employees = germany_df.sort_values("employees", ascending=False)
germany_df_sort_employees.head()

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
5,Volkswagen,6,240264,1.5,5937.3,432116,,Matthias Muller,Motor Vehicles and Parts,Motor Vehicles & Parts,7,Germany,"Wolfsburg, Germany",http://www.volkswagen.com,23,626715,97753
116,Deutsche Post DHL Group,117,65787,-3.8,2918.3,40387,70.8,Frank Appel,"Mail, Package, and Freight Delivery",Transportation,108,Germany,"Bonn, Germany",http://www.dpdhl.com,23,459262,11693
75,Robert Bosch,76,80869,3.3,2155.3,86348,-39.1,Volkmar Denner,Motor Vehicles and Parts,Motor Vehicles & Parts,87,Germany,"Stuttgart, Germany",http://www.bosch.com,23,389281,36316
308,Edeka Zentrale,309,34193,6.8,356.0,6921,28.3,Markus Mosa,Wholesalers: Food and Grocery,Wholesalers,321,Germany,"Hamburg, Germany",http://www.edeka-verbund.de,19,351500,1729
65,Siemens,66,88419,0.9,6050.5,141271,-27.4,Josef Kaeser,Industrial Machinery,Industrials,71,Germany,"Munich, Germany",http://www.siemens.com,23,351000,38444


In [25]:
# retrieves the value located at the first row and the first column of the germany_df_sort_employees DataFrame.
germany_df_sort_employees.iloc[0,0] 

'Volkswagen'

#### Companies with most number of employes for each country

In [26]:
countries = f500["country"].unique() # retrieves the unique values from the "country" column
countries

array(['USA', 'China', 'Japan', 'Germany', 'Netherlands', 'Britain',
       'South Korea', 'Switzerland', 'France', 'Taiwan', 'Singapore',
       'Italy', 'Russia', 'Spain', 'Brazil', 'Mexico', 'Luxembourg',
       'India', 'Malaysia', 'Thailand', 'Australia', 'Belgium', 'Norway',
       'Canada', 'Ireland', 'Indonesia', 'Denmark', 'Saudi Arabia',
       'Sweden', 'Finland', 'Venezuela', 'Turkey', 'U.A.E', 'Israel'],
      dtype=object)

the code written below finds the company with the highest number of employees for each country in the f500 DataFrame and stores the information in the top_employer_by_country dictionary. Here is the step by step explaination of the code written below,
1. Initialize an empty dictionary called top_employer_by_country.
2. Iterate over each unique country name in the countries array.
3. For each country, create a subset DataFrame called country_df that contains only the rows where the "country" column matches the current country name.
4. Sort the country_df DataFrame based on the "employees" column in descending order using the sort_values() method, with ascending=False. This sorts the DataFrame in a way that the companies with the highest number of employees appear at the top.
5. Retrieve the value from the first row and the first column of the sorted country_df DataFrame using iloc[0, 0]. This corresponds to the company name with the highest number of employees in that country.
6. Assign the company name to the top_employer_by_country dictionary, using the country name as the key.
7. Repeat steps 3-6 for each unique country name in the countries array.
8. After the loop completes, the top_employer_by_country dictionary will contain the company names with the highest number of employees for each country.

In [27]:
top_employer_by_country = {}

for i in countries:
    #print(i)
    country_df = f500[f500["country"] == i] 
    country_df_sort_employees = country_df.sort_values("employees", ascending=False)
    top_employer_by_country[i] = country_df_sort_employees.iloc[0,0]
    

In [28]:
# printing dictionary where key is country and value contains the name of company with most number of employees.
top_employer_by_country 

{'USA': 'Walmart',
 'China': 'China National Petroleum',
 'Japan': 'Toyota Motor',
 'Germany': 'Volkswagen',
 'Netherlands': 'EXOR Group',
 'Britain': 'Compass Group',
 'South Korea': 'Samsung Electronics',
 'Switzerland': 'Nestle',
 'France': 'Sodexo',
 'Taiwan': 'Hon Hai Precision Industry',
 'Singapore': 'Flex',
 'Italy': 'Poste Italiane',
 'Russia': 'Gazprom',
 'Spain': 'Banco Santander',
 'Brazil': 'JBS',
 'Mexico': 'America Movil',
 'Luxembourg': 'ArcelorMittal',
 'India': 'State Bank of India',
 'Malaysia': 'Petronas',
 'Thailand': 'PTT',
 'Australia': 'Wesfarmers',
 'Belgium': 'Anheuser-Busch InBev',
 'Norway': 'Statoil',
 'Canada': 'George Weston',
 'Ireland': 'Accenture',
 'Indonesia': 'Pertamina',
 'Denmark': 'Maersk Group',
 'Saudi Arabia': 'SABIC',
 'Sweden': 'H & M Hennes & Mauritz',
 'Finland': 'Nokia',
 'Venezuela': 'Mercantil Servicios Financieros',
 'Turkey': 'Koc Holding',
 'U.A.E': 'Emirates Group',
 'Israel': 'Teva Pharmaceutical Industries'}