# Python Crash Course - Exercise 08

Topics covered:
* pandas
* binary search

Tasks:
* Task 1: Cities, Countries, Languages with pandas
* Task 2: Implementing your own binary search algorithm

In [1]:
# import modules/packages
import pandas as pd

# Task 1: Cities, Countries, Languages with pandas

For this task, we'll be working with the (Kaggle)[https://www.kaggle.com/datasets/adilshamim8/world-cities-countries-and-languages-dataset] data set "World Cities, Countries & Languages."

First, load the 3 csv files in the subfolder `./ccl/`. Then, with the help of the data sets, answer the following questions

**Q1** Of the world's largest 100 cities (by population), how many are located in India?

**Q2** What is the most spoken official language in Tanzania? How many percent of people speak it, and how many people (in absolute numbers) speak it?

**Q3** What is the most spoken language in Tanzania? (including *all* languages, not just the "official" ones) How many percent of people speak it, and how many people (in absolute numbers) speak it?

**Q4** For each language, count in how many countries it is spoken. What are the 20 languages with the highest number of countries-they-are-spoken-in?

**Q5** Of all the languages spoken in only one country, which two are spoken by 100% of the population according to the data set?

In [2]:
# YOUR CODE HERE

Reading in the 3 data sets

In [3]:
city = pd.read_csv("./ccl/city.csv")
city.head()

Unnamed: 0,ID,Name,CountryCode,District,Population
0,1,Kabul,AFG,Kabol,1780000
1,2,Qandahar,AFG,Qandahar,237500
2,3,Herat,AFG,Herat,186800
3,4,Mazar-e-Sharif,AFG,Balkh,127800
4,5,Amsterdam,NLD,Noord-Holland,731200


In [4]:
country = pd.read_csv("./ccl/country.csv")
country.head()

Unnamed: 0,Code,Name,Continent,Region,SurfaceArea,IndepYear,Population,LifeExpectancy,GNP,GNPOld,LocalName,GovernmentForm,HeadOfState,Capital,Code2
0,ABW,Aruba,North America,Caribbean,193.0,,103000,78.4,828.0,793.0,Aruba,Nonmetropolitan Territory of The Netherlands,Beatrix,129.0,AW
1,AFG,Afghanistan,Asia,Southern and Central Asia,652090.0,1919.0,22720000,45.9,5976.0,,Afganistan/Afqanestan,Islamic Emirate,Mohammad Omar,1.0,AF
2,AGO,Angola,Africa,Central Africa,1246700.0,1975.0,12878000,38.3,6648.0,7984.0,Angola,Republic,José Eduardo dos Santos,56.0,AO
3,AIA,Anguilla,North America,Caribbean,96.0,,8000,76.1,63.2,,Anguilla,Dependent Territory of the UK,Elisabeth II,62.0,AI
4,ALB,Albania,Europe,Southern Europe,28748.0,1912.0,3401200,71.6,3205.0,2500.0,Shqipëria,Republic,Rexhep Mejdani,34.0,AL


In [5]:
language = pd.read_csv("./ccl/countrylanguage.csv")
language.head()

Unnamed: 0,CountryCode,Language,IsOfficial,Percentage
0,ABW,Dutch,T,5.3
1,ABW,English,F,9.5
2,ABW,Papiamento,F,76.7
3,ABW,Spanish,F,7.4
4,AFG,Balochi,F,0.9


**Q1** Top 100 populated cities that are located in India:

In [6]:
# country code of India?
country.loc[(country["Name"]=="India"), "Code"]

99    IND
Name: Code, dtype: object

In [7]:
# to extract only the string value of that cell: .values[]
country.loc[(country["Name"]=="India"), "Code"].values[0]

'IND'

In [8]:
# to extract only the string value of that cell: .values[]
india_code = country.loc[(country["Name"]=="India"), "Code"].values[0]

In [9]:
# sort a copy of the city df by descending population
city_sorted = city.copy()
city_sorted = city_sorted.sort_values(by="Population", ascending=False).reset_index(drop=True)
# keep only the first 100 rows
city_sorted = city_sorted[0:100].copy()
# how many of those have the country code IND?
print("Cities in India:", len(city_sorted[city_sorted["CountryCode"]==india_code]))
city_sorted[city_sorted["CountryCode"]==india_code]

Cities in India: 8


Unnamed: 0,ID,Name,CountryCode,District,Population
0,1024,Mumbai (Bombay),IND,Maharashtra,10500000
13,1025,Delhi,IND,Delhi,7206704
26,1026,Calcutta [Kolkata],IND,West Bengali,4399819
34,1027,Chennai (Madras),IND,Tamil Nadu,3841396
47,1028,Hyderabad,IND,Andhra Pradesh,2964638
51,1029,Ahmedabad,IND,Gujarat,2876710
58,1030,Bangalore,IND,Karnataka,2660088
99,1031,Kanpur,IND,Uttar Pradesh,1874409


**Q2** What is the most spoken official language in Tanzania? How many percent of people speak it, and how many people (in absolute numbers) speak it?

In [10]:
# find the country code of Tanzania (like for India in Q1)
tanzania_code = country.loc[(country["Name"]=="Tanzania"), "Code"].values[0]
tanzania_code

'TZA'

In [11]:
# languages, filtered by country == Tanzania and official == T:
language[(language["CountryCode"]==tanzania_code)&(language["IsOfficial"]=="T")]

Unnamed: 0,CountryCode,Language,IsOfficial,Percentage
890,TZA,Swahili,T,8.8


In [12]:
# in absolute numbers: multiply population (see "country" data frame) by percentage
tanzania_population = country.loc[country["Code"]==tanzania_code, "Population"].values[0]
swahili_percentage = language[(language["CountryCode"]==tanzania_code)&(language["IsOfficial"]=="T")]["Percentage"].values[0]
total_swahili_speakers = int(round(tanzania_population * swahili_percentage/100, -3)) # rounding to thousands
total_swahili_speakers
# approximately 3 million people!

2949000

**Q3** What is the most spoken language in Tanzania? (including *all* languages, not just the "official" ones) How many percent of people speak it, and how many people (in absolute numbers) speak it?

In [13]:
# languages, filtered by country == Tanzania:
languages_tza = language[(language["CountryCode"]==tanzania_code)].reset_index(drop=True).copy()
languages_tza = languages_tza.sort_values(by="Percentage", ascending=False).reset_index(drop=True)
languages_tza

Unnamed: 0,CountryCode,Language,IsOfficial,Percentage
0,TZA,Nyamwesi,F,21.1
1,TZA,Swahili,T,8.8
2,TZA,Hehet,F,6.9
3,TZA,Makonde,F,5.9
4,TZA,Haya,F,5.9
5,TZA,Nyakusa,F,5.4
6,TZA,Chaga and Pare,F,4.9
7,TZA,Luguru,F,4.9
8,TZA,Shambala,F,4.3
9,TZA,Gogo,F,3.9


In [14]:
languages_tza.loc[0,"Language"] # most spoken language

'Nyamwesi'

In [15]:
# number of people speaking it:
int(round(tanzania_population * languages_tza.loc[0,"Percentage"]/100 , -3))
# approx 7 milliion people!

7072000

**Q4** For each language, count in how many countries it is spoken. What are the 20 languages with the highest number of countries-they-are-spoken-in?



In [16]:
languages_list = []
country_counts = []
for lang, group in language.groupby("Language"):
    languages_list.append(lang)
    country_counts.append(len(group["CountryCode"].unique()))
counts_df = pd.DataFrame(
    {
        "language":languages_list,
        "country_count":country_counts
    }
)
counts_df = counts_df.sort_values(by="country_count", ascending=False).reset_index(drop=True)
counts_df[0:20]

Unnamed: 0,language,country_count
0,English,60
1,Arabic,33
2,Spanish,28
3,French,25
4,German,19
5,Chinese,19
6,Russian,17
7,Italian,15
8,Creole English,14
9,Portuguese,12


**Q5** Of all the languages spoken in only one country, which two are spoken by 100% of the population according to the data set?

In [17]:
languages_in_one_country = list(counts_df[counts_df["country_count"]==1]["language"])

In [18]:
language[(language["Language"].isin(languages_in_one_country)) & (language["IsOfficial"]=="T") & (language["Percentage"]==100)]

Unnamed: 0,CountryCode,Language,IsOfficial,Percentage
296,FRO,Faroese,T,100.0
543,MDV,Dhivehi,T,100.0


# Task 2: `binary_search`

Write a function `binary_search` that:
* takes two input arguments: an **already sorted** list of numbers; and a **number**
* returns `True` or `False`, depending on whether the number is on the list or not
* implements a **binary search algorithm** ("cutting the search space in half at each step") to find out whether the number is on the list

If you need some inspiration on how to approach this task, check out **Chapter 8  - Friday: Writing a Binary Search** from the CM book [**Python Projects for Beginners**](https://github.com/anastassiavybornova/learn-python-2025/blob/main/books/CM.pdf)

You can use the `./numbers/numbers.csv` file (provided with this notebook) to test your function:
* `binary_search(4403)` should return `True` (4403 is on the list)
* `binary_search(52301)` should return False (52301 is not on the list)

In [19]:
# YOUR CODE HERE

In [20]:
def binary_search(my_sorted_list, my_item):
    '''
    function that uses the binary search algorithm
    to find whether my_item is in my_list,
    and if so, in which position
    '''

    # define the "search limits"
    left_limit = 0
    right_limit = len(my_sorted_list) - 1 # this is the index of the last element on the list

    # while we still have a list with more than 1 number to search:
    while left_limit <= right_limit:

        # find the "middle index" (// gives us integer division, always rounds down to next integer)
        middle_index = (right_limit + left_limit) // 2
        
        # if the item at middle_index position is the one we're looking for,
        # we're done! we found the item on the list - so return True
        if my_sorted_list[middle_index] == my_item:
            return True
        
        # if at the "middle" position we have something SMALLER than my_item,
        # my_item must be in the right half of the search space,
        # so we need to modify the left limit:
        elif my_sorted_list[middle_index] < my_item:
            left_limit = middle_index + 1

        # if at the "middle" position we have something BIGGER than my_item,
        # then my_item must be in the left half of the search space,
        # so we need to modify the right limit:
        else:
            right_limit = middle_index - 1

    # the while loop ends when left_limit becomes smaller or equal to right limit,
    # which means that we will have searched the entire space, but found no match;
    # in that case it means our number is not on the list;
    # then return False:
    return False

**Read in the data from `numbers.csv` to a list**

In [21]:
# import pandas
import pandas as pd
# read in numbers.csv into dataframe called "df"; 
# header = None means that in the csv, the first line is not a header row, but already contains data
df = pd.read_csv("./numbers/numbers.csv", header = None)
# the numbers are in a column called "0" by default (because no header was provided);
# use list() to convert the column into a list
my_list = list(df[0])

In [22]:
# check if your function works as expected
assert binary_search(my_list, 4403) == True
assert binary_search(my_list, 52301) == False