Week 6 of the Leep Talent Data Technician Skills Bootcamp (Level 3)
This section showcases the Python skills I developed during the sixth week of my bootcamp, covering core Python syntax, data structures, control flow, pandas data analysis, data cleaning, outlier detection, and data visualisation — all in Google Colab using real-world datasets.
| Task | Topic | Skills |
|---|---|---|
| Setup & Environment | Google Colab, libraries, mounting Drive | pip, import, pd.read_csv, Drive mount |
| Core Syntax — Shortcuts & Comparisons | Operators and boolean logic | +=, -=, *=, //=, %=, ==, !=, <, > |
| Data Structures | Lists, tuples, sets, dictionaries | indexing, slicing, .append(), .get(), .keys() |
| Control Flow — if / elif / else | Conditional logic | Customer tiers, time-of-day greetings, score graders |
| Day 2 Task 1 — FizzBuzz & Loops | Loops and modulo logic | for, while, range(), % operator |
| Day 3 Task 1 — Student Analysis (pandas) | pandas end-to-end analysis | groupby, pivot_table, apply, sort_values |
| Day 4 Task 1 — GDP Data Cleaning | Real-world data preparation | dropna, str.strip, pd.to_datetime, type conversion |
| Day 4 Task 2 — Visualisation | Charts and correlation analysis | Matplotlib, Seaborn, scatter, histogram, heatmap, IQR outliers |
Tool: Google Colab (Python 3)
Libraries: pandas, NumPy, Matplotlib, Seaborn
All Python work this week was completed in Google Colab — a cloud-based Jupyter notebook environment that runs in the browser without any local installation. Each new session requires re-running the setup cell and remounting Google Drive.
In my own words: Google Colab removes the friction of environment setup so you can focus on the analysis. The key habit to develop is understanding what needs re-running at the start of each session versus what only needs to run once — getting that wrong means Python can't find your files or libraries mid-notebook.
# Install libraries (run once per session)
!pip install pandas matplotlib seaborn --quiet
# Core imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Display options
pd.set_option("display.max_rows", 10)
sns.set(style="whitegrid")
print("Setup complete")# Mount Google Drive to access files stored there
from google.colab import drive
drive.mount('/content/drive')# Upload directly from your device
from google.colab import files
uploaded = files.upload()
# Load the uploaded file into a DataFrame
df = pd.read_csv('/content/student.csv')
df.head()📸 Open Google Colab. Run the setup cell above. Capture the output showing
Setup completeat the bottom of the cell and the library installation progress. The cell run indicator (green tick/circle) should be visible on the left. Save asimages/setup_colab_complete.png.
Understanding the Colab environment — what persists, what resets, and how to efficiently re-initialise a session — is practical knowledge that applies equally to any remote or cloud-based Python environment, including Azure Notebooks, AWS SageMaker, and Jupyter Hub deployments used by real organisations.
Tool: Google Colab
Reference: shortcuts_and_comparisions.docx
Practised Python's shortcut assignment operators and comparison operators — the building blocks of any script or data pipeline. Each operator was run and verified against expected output.
In my own words: Shortcut operators like
+=and%=make code shorter and more readable. They appear constantly in loops, accumulators, and data processing scripts, so being confident with them early means the logic of more complex code is easier to follow.
# Addition shortcut
points = 10
points += 5
print(points) # 15
# Subtraction shortcut
score = 20
score -= 5
print(score) # 15
# Multiplication shortcut
price = 50
price *= 3
print(price) # 150
# Division shortcut (returns float)
total = 100
total /= 4
print(total) # 25.0
# Integer division (floors the result)
boxes = 53
boxes //= 10
print(boxes) # 5
# Modulo (remainder)
rem = 27
rem %= 10
print(rem) # 7# Equality and inequality
number = 60
print(number == 60) # True
print(number != 60) # False
# Greater / less than
mark = 85
print(mark > 90) # False
print(mark < 90) # True
# Compound comparisons
age = 18
print(age >= 18) # True
print(age <= 18) # Truex = 10
x += 3
print(x) # 13
price = 50
price *= 2
price /= 5
print(price) # 20.0
a = 7
b = 10
print(a < b) # True
age = 18
print(age >= 18) # True
print(age != 21) # True📸 In Google Colab, run the shortcut operators code block. Capture the output showing the results (15, 15, 150, 25.0, 5, 7) clearly visible below the cell. The cell input (code) and output should both be visible in the same capture. Save as
images/syntax_shortcuts_output.png.
Tool: Google Colab
Reference: List__tuple_and_Dictionary_Tasks.docx
Worked through Python's core data structure types — lists, tuples, sets, and dictionaries — practising creation, indexing, slicing, modification, and retrieval for each.
In my own words: Data structures are how Python organises information in memory. Lists are the most flexible because they're ordered and mutable. Tuples are for data that shouldn't change. Sets automatically remove duplicates. Dictionaries are the most powerful for labelled data — essentially a lookup table, which is how a lot of real-world data processing works.
# Create and type-check
mylist = ["apple", "banana", "cherry"]
print(type(mylist)) # <class 'list'>
# Mixed-type list (real-world customer record)
customer_info = ["Mr.X", 22, "UK", "12/05/2001", 1.65]
print(customer_info)
# Negative indexing — get elements [2, 3, 4, 5]
mylist2 = [1, 2, 3, 4, 5, 6, 7]
print(mylist2[-6:-2]) # [2, 3, 4, 5]
# Slicing
nums = [10, 20, 30, 40, 50, 60, 70, 80, 90]
print(nums[2:7]) # [30, 40, 50, 60, 70]
# Extend, append, insert
thislist = ["apple", "banana", "cherry"]
newlist = ["mango", "orange", "plum", "pineapple", "peach"]
thislist.extend(newlist)
thislist.append("orange")
thislist.insert(2, "watermelon")
# Delete a slice
del thislist[2:5]
print(thislist)# Create tuple and access by index
mytuple = ("apple", "banana", "cherry")
print(mytuple[1]) # banana
print(mytuple[-1]) # cherry (negative index)
# Concatenation and repetition
my_tuple = (1, 10, 100)
t1 = my_tuple + (1000, 10000)
t2 = my_tuple * 3
print(t2) # (1, 10, 100, 1, 10, 100, 1, 10, 100)
print(len(t2)) # 9
# Membership test
print(10 in my_tuple) # True
print(-10 not in my_tuple) # True# Sets are unordered and remove duplicates
mySet2 = {"apple", "banana", "cherry", "apple", "banana"}
print(mySet2) # {'cherry', 'banana', 'apple'}# Create a dictionary
mydict = {"brand": "Ford", "model": "Mustang", "year": 1964}
print(mydict.get("brand")) # Ford
# Keys and values
dictionary = {"cat": "chat", "dog": "chicken", "horse": "cheval"}
print(dictionary.keys()) # dict_keys(['cat', 'dog', 'horse'])
print(dictionary.values()) # dict_values(['chat', 'chicken', 'cheval'])Screenshot 1 — List operations (slicing and extend)
📸 In Google Colab, run the list slicing and
.extend()code. Capture the cell with both the code and the output showing the list before and after modification. Save asimages/ds_list_operations.png.
Screenshot 2 — Dictionary keys and values
📸 Run the dictionary code block. Capture the cell showing the
.keys()and.values()outputs clearly. Save asimages/ds_dict_keys_values.png.
Every pandas DataFrame is built on Python dictionaries under the hood. Understanding how dictionaries, lists, and indexing work in plain Python makes debugging and extending pandas operations far more intuitive — and is a prerequisite for writing efficient data pipeline code.
Tool: Google Colab
Reference: if_else__elif.xlsx
Practised writing conditional logic using if, elif, and else — including nested conditions and compound conditions with and. Applied to real business scenarios such as customer tiering, time-based greetings, and score grading.
In my own words:
if/elif/elseis how Python makes decisions. In data analysis, it's used constantly — for categorising records, flagging anomalies, assigning grades, and routing logic. Understanding the order of conditions matters: Python evaluates from top to bottom and stops at the first match, so the most specific condition always comes first.*
annualSales = 300000
region = "North"
if annualSales >= 500000:
print("Gold Customer")
elif annualSales >= 300000:
print("Silver Customer")
# Nested if — adds region-specific logic
if region == "North":
print("Priority support for North region")
elif annualSales >= 100000:
print("Bronze Customer")
print("Thank you for your business")
# Output:
# Silver Customer
# Priority support for North region
# Thank you for your businessnumber1 = int(input("Enter the first number: "))
number2 = int(input("Enter the second number: "))
if number1 > number2:
print("The larger number is:", number1)
elif number2 > number1:
print("The larger number is:", number2)
else:
print(number1, "is equal to", number2)num = float(input("Enter a number: "))
if num > 0:
print("The number is positive.")
elif num < 0:
print("The number is negative.")
else:
print("The number is zero.")# Score grader
user_number = int(input("Provide a number: "))
if user_number >= 100:
print("Excellent")
elif user_number >= 75:
print("Good")
elif user_number >= 50:
print("Average")
else:
print("Below Average")
# Time-based greeting
hour = 14
if hour < 12:
print("Good Morning")
elif hour < 18:
print("Good Afternoon") # Output: Good Afternoon
elif hour < 21:
print("Good Evening")
else:
print("Good Night")
# Membership discount
status = input("Enter your membership level (Gold, Silver, Bronze): ")
if status == "Gold":
print("Discount: 20%")
elif status == "Silver":
print("Discount: 10%")
elif status == "Bronze":
print("Discount: 5%")
else:
print("No discount available.")📸 In Google Colab, run the customer tier system code with
annualSales = 300000andregion = "North". Capture the cell showing the three lines of output: "Silver Customer", "Priority support for North region", and "Thank you for your business". Save asimages/cf_customer_tier.png.
Conditional logic is the backbone of data categorisation — assigning grades, flagging outliers, routing records to different processing paths. This task demonstrates I can write multi-branch logic correctly, including nested conditions, which is directly relevant to the assign_grade() function used in the pandas exercises later in the week.
Tool: Google Colab
Reference: while_for_loops_table_and_Fbuzz.xlsx, workbook Day 2
FizzBuzz is a classic programming interview challenge. The task also required writing while and for loops for even numbers, multiples of 7, and odd number sequences — demonstrating understanding of iteration, range(), step arguments, and the modulo operator.
In my own words: FizzBuzz tests whether you understand conditional ordering — the divisible-by-15 check must come before the individual 3 and 5 checks because Python stops at the first True condition. Getting that order wrong produces incorrect output silently, which is a good lesson in why the sequence of conditions matters in any real-world classification logic.*
Organisation type: Any tech team conducting coding interviews
FizzBuzz is used in software engineering interviews precisely because it requires combining loops, conditionals, and the modulo operator correctly — testing logical thinking under pressure rather than just syntax knowledge.
# range(1, 101) — starts at 1, ends exactly at 100
# FizzBuzz check MUST come first to avoid being skipped by the
# individual Fizz or Buzz conditions
for n in range(1, 101):
if n % 3 == 0 and n % 5 == 0:
print("FizzBuzz")
elif n % 3 == 0:
print("Fizz")
elif n % 5 == 0:
print("Buzz")
else:
print(n)# Even numbers 0–100 using while
my_number = 0
while my_number <= 100:
print(my_number)
my_number += 2
# Multiples of 7 using while + if
my_number = 0
while my_number <= 100:
if my_number % 7 == 0:
print(my_number)
my_number += 1# Numbers 1 to 50
for n in range(1, 51):
print(n, end=" ")
# Even numbers 0–100 (step argument)
for n in range(0, 101, 2):
print(n, end=" ")
# Multiples of 7 (step = 7)
for n in range(0, 101, 7):
print(n, end=" ")
# Odd numbers 1–100
for n in range(1, 101, 2):
print(n, end=" ")Screenshot 1 — FizzBuzz output (first 20 numbers)
📸 In Google Colab, run the FizzBuzz code. Capture the output scrolled to show the first 20 numbers — you should clearly see 1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz, Buzz, 11, Fizz, 13, 14, FizzBuzz, 16... demonstrating all three conditions working correctly. Save as
images/loop_fizzbuzz.png.
Screenshot 2 — Even numbers 0–100 (while loop)
📸 Run the even-numbers while loop. Capture the output showing even numbers from 0 to 100. The final value visible should be 100. Save as
images/loop_while_even.png.
Screenshot 3 — For loop with step argument (multiples of 7)
📸 Run the multiples-of-7 for loop using
range(0, 101, 7). Capture the output showing 0, 7, 14, 21... 98. Save asimages/loop_for_range_step.png.
- Finding 1: Ordering the
FizzBuzzcheck first is essential — ifn % 3 == 0came first, every multiple of 15 would print "Fizz" instead of "FizzBuzz". This mirrors real-world classification logic where overlapping categories must be resolved by specificity. - Finding 2: The
stepargument inrange()is more efficient than using awhileloop with manual increment — it communicates intent directly in the function call and reduces the chance of accidental infinite loops.
FizzBuzz demonstrates that I can combine loops, conditionals, and the modulo operator correctly — a combination that appears in data processing tasks such as batching records, flagging periodic events, and applying rules in sequence.
Dataset: student.csv
File: student.csv
Tool: Google Colab, pandas
A classroom dataset containing student names, class groups, gender, and marks. Used for a structured group exercise covering the full pandas workflow from loading and exploring through to aggregation, advanced operations, and export.
In my own words: This exercise walked through every pandas operation a junior analyst would need to know on day one. The grade assignment function using
apply()was the most interesting part — it shows how Python functions and pandas work together to create entirely new analytical columns from existing data.*
Organisation type: School, training provider, or any organisation tracking performance metrics
A teacher or education data analyst would use these exact techniques to produce class performance reports, identify at-risk students, and compare performance across demographic groups — all operations that would take much longer in a spreadsheet.
import pandas as pd
# Load CSV
df = pd.read_csv('student.csv')
# First 5 rows
df.head()
# Column info and data types
df.info()
# Summary statistics
df.describe()# Single column
df['name']
# Multiple columns
df[['name', 'mark']]
# First 3 rows
df.head(3)
# Filter rows where class is 'Four'
df[df['class'] == 'Four']# Add a 'passed' column (True/False)
df['passed'] = df['mark'] >= 60
# Rename column
df = df.rename(columns={'mark': 'score'})
# Drop a column
df = df.drop(columns=['passed'])# Mean score per class
df.groupby('class')['score'].mean()
# Count students per class
df.groupby('class')['name'].count()
# Average score by gender
df.groupby('gender')['score'].mean()# Pivot table: class vs gender, values = score
df.pivot_table(index='class', columns='gender', values='score')
# Assign grades using a custom function
def assign_grade(score):
if score >= 85:
return 'A'
elif score >= 70:
return 'B'
elif score >= 60:
return 'C'
else:
return 'D'
df['grade'] = df['score'].apply(assign_grade)
# Sort by score descending
df.sort_values(by='score', ascending=False)# Save with new 'grade' column to CSV
df.to_csv('student_grades.csv', index=False)Screenshot 1 — df.head() and df.info()
📸 In Google Colab, run the load and explore cells. Capture the output showing
df.head()(first 5 rows with all columns visible) and below itdf.info()showing the column names, data types, and non-null counts. Save asimages/pandas_student_load.png.
Screenshot 2 — groupby mean score per class
📸 Run
df.groupby('class')['score'].mean(). Capture the output showing each class name on the left and its average score on the right. Save asimages/pandas_student_groupby.png.
Screenshot 3 — Pivot table (class × gender)
📸 Run the pivot table code. Capture the output table showing class names in rows, gender columns (Female/Male or similar), and average score values in each cell. Save as
images/pandas_student_pivot.png.
Screenshot 4 — Grade column applied + sorted descending
📸 Run
df['grade'] = df['score'].apply(assign_grade)followed bydf.sort_values(by='score', ascending=False). Capture the resulting DataFrame with thegradecolumn visible, sorted so the highest scores and 'A' grades appear at the top. Save asimages/pandas_student_grades.png.
- Finding 1:
groupby()reduced a complex manual calculation to a single line — grouping by class and averaging scores would take minutes in a spreadsheet for a large dataset but runs instantly in pandas regardless of size. - Finding 2: Using
apply()with a custom function to assign grades is far more readable and maintainable than nestednp.where()calls — the function logic is clearly visible and easy to modify if the grading boundaries change.
This exercise covers the complete pandas workflow that appears in most junior analyst technical tests: load, explore, filter, transform, aggregate, and export. The ability to write a groupby aggregation and a custom apply() function demonstrates intermediate pandas competence, not just beginner-level data loading.
Dataset: GDP (nominal) per Capita.csv
Files: GDP__nominal__per_Capita.csv, GDP_nominal_per_capita_clean.csv
Notebook: Week4.ipynb
Tool: Google Colab, pandas
A global GDP per capita dataset covering approximately 200 countries and territories, with estimates from three sources — IMF, World Bank, and UN — alongside the year each estimate was recorded. Columns: Country/Territory, UN_Region, IMF_Estimate, IMF_Year, WorldBank_Estimate, WorldBank_Year, UN_Estimate, UN_Year.
In my own words: Real datasets are never clean. This one had missing values, inconsistent whitespace in country names, mixed data types, and numeric columns stored as strings. Working through each of these systematically — checking first, then fixing, then verifying — is how a professional analyst prepares data before any analysis or visualisation takes place.*
Organisation type: International research organisation, government economics team, financial institution
Any analysis of global economic data needs to account for missing estimates (not all countries report to all three agencies), inconsistent formatting from different data sources, and the difference between a year stored as an integer versus a proper datetime object — which affects how time-based filtering and comparison work downstream.
import pandas as pd
df = pd.read_csv("GDP (nominal) per Capita.csv")
df.head(10) # First 10 rows
df.tail(5) # Last 5 rows
# View two key columns
df[['Country/Territory', 'UN_Region']]
# Column info
df.info()
# Summary statistics
df.describe()
# Value counts per region
df['UN_Region'].value_counts()# Total missing values per column
df.isnull().sum()
# Total across the whole DataFrame
df.isnull().sum().sum()
# Rows where all values are missing
rows_all_na = df[df.isna().all(axis=1)]
# Rows where any value is missing
rows_with_na = df[df.isna().any(axis=1)]
# Create a cleaned version removing all rows with any missing value
gdp_clean = df.dropna()
# Verify no missing values remain
gdp_clean.isna().values.any() # False
# Double-check per column
gdp_clean.isnull().sum()print("Initial Data Types:")
print(df.dtypes)
# a) Strip whitespace from country names
df['Country/Territory'] = df['Country/Territory'].str.strip()
# b) Convert IMF_Year to datetime
df['IMF_Year'] = pd.to_datetime(
df['IMF_Year'],
format='%Y',
errors='coerce' # invalid entries become NaT (Not a Time)
)
print(df['IMF_Year'].head())
print(df['IMF_Year'].dtype) # datetime64[ns]
# c) Convert WorldBank_Estimate to integer
# (coerce non-numeric to NaN, fill with 0, then cast to int)
df['WorldBank_Estimate'] = (
pd.to_numeric(df['WorldBank_Estimate'], errors='coerce')
.fillna(0)
.astype(int)
)
print(df['WorldBank_Estimate'].dtype) # int64
print(df.dtypes)What is NaT?
NaT stands for Not a Time — the datetime equivalent of NaN. It appears when a date value is missing or could not be converted, and behaves like a null in all datetime operations.
col = 'IMF_Estimate'
# Calculate quartiles and IQR
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
# Define bounds
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
# Filter to outlier rows only
outliers = df[(df[col] < lower) | (df[col] > upper)]
print(outliers.shape) # How many outliers found
outliers.head() # First few outlier rows (high-GDP countries)# Create cleaned copy with all transformations applied
gdp_clean = df.dropna().copy()
gdp_clean['Country/Territory'] = gdp_clean['Country/Territory'].str.strip()
gdp_clean['IMF_Year'] = pd.to_datetime(
gdp_clean['IMF_Year'], format='%Y', errors='coerce'
)
# Save to CSV
gdp_clean.to_csv('GDP_nominal_per_capita_clean.csv', index=False)
# Download to device
from google.colab import files
files.download('GDP_nominal_per_capita_clean.csv')Screenshot 1 — df.head(10) showing the GDP dataset
📸 In Google Colab, run
df.head(10). Capture the full table showing all 9 columns (Country/Territory, UN_Region, IMF_Estimate, IMF_Year, WorldBank_Estimate, WorldBank_Year, UN_Estimate, UN_Year) and the first 10 rows. Monaco should appear as the first country (highest GDP). Save asimages/gdp_head10.png.
Screenshot 2 — Missing values per column
📸 Run
df.isnull().sum(). Capture the output showing each column name and its count of missing values. Columns with non-zero counts should be clearly visible. Save asimages/gdp_missing_values.png.
Screenshot 3 — Data types before and after conversion
📸 Run the
print("Initial Data Types:")block first, then after all three type conversions runprint(df.dtypes)again. Capture both outputs side-by-side or in sequence to show how IMF_Year changed from object/int to datetime64 and WorldBank_Estimate changed to int64. Save asimages/gdp_dtypes_change.png.
Screenshot 4 — IQR outlier detection result
📸 Run the IQR outlier block. Capture the output of
outliers.head()showing the rows flagged as outliers — these should be high-GDP countries like Monaco, Liechtenstein, or Luxembourg. The Q1, Q3, IQR, lower, and upper bounds should be visible if printed, or the table alone is sufficient. Save asimages/gdp_outliers_iqr.png.
- Finding 1: The
errors='coerce'parameter inpd.to_datetime()andpd.to_numeric()is essential for defensive data cleaning — it converts unparseable values toNaTorNaNrather than crashing the notebook, which is the correct behaviour in production code. - Finding 2: The IQR outlier detection flags high-GDP countries like Monaco and Luxembourg as statistical outliers. These aren't data errors — they are genuinely extreme values. A responsible analyst documents this distinction: outliers require investigation, not automatic removal.
This task demonstrates a complete, professional data cleaning workflow — the exact steps an analyst would follow before handing data to a reporting or modelling pipeline: check, clean, retype, validate, detect anomalies, and export. The use of .copy() to avoid modifying the original DataFrame is also good practice that distinguishes careful from careless code.
Dataset: GDP (nominal) per Capita.csv / GDP_nominal_per_capita_clean.csv
Notebook: W6D4_Task_for_workbook__1_.ipynb
Tool: Google Colab, Matplotlib, Seaborn
Extended the GDP analysis into visualisation — creating scatter plots, histograms, a correlation heatmap, and a top-10 bar chart. The task introduced both Matplotlib and Seaborn, demonstrating how each library's strengths suit different visualisation types.
In my own words: Visualisation is where analysis becomes communication. A table of IMF estimates is hard to read; a scatter plot showing the relationship between WorldBank and UN estimates makes the pattern immediately obvious. Choosing the right chart type for the data — and knowing which library to use — is as important as the underlying numbers.*
Organisation type: International economics research team, central bank, global consultancy
Analysts at organisations like the IMF or World Bank present findings to policymakers who need visual summaries of global trends. A correlation heatmap immediately shows which estimates align closely between sources, and a bar chart of top-10 economies gives context to any single-country discussion.
import matplotlib.pyplot as plt
# Full histogram of all numeric columns
df.hist(figsize=(15, 15))
plt.show()
# Focused histogram for IMF Estimate only
df.hist(column='IMF_Estimate', figsize=(6, 4))
plt.show()import numpy as np
plt.figure(figsize=(10, 6))
# Scatter points
plt.scatter(df['WorldBank_Estimate'], df['UN_Estimate'],
color='blue', alpha=0.6)
# Trend line (linear regression)
m, b = np.polyfit(df['WorldBank_Estimate'], df['UN_Estimate'], 1)
x_line = range(int(df['WorldBank_Estimate'].min()),
int(df['WorldBank_Estimate'].max()))
plt.plot(x_line, [m * x + b for x in x_line], color='red', linewidth=2)
plt.title('UN Estimates vs World Bank Estimates')
plt.xlabel('World Bank Estimate')
plt.ylabel('UN Estimate')
plt.grid(True)
plt.show()import seaborn as sns
# Select only numeric columns
numerical_features = df.select_dtypes(include=['number'])
# Calculate correlation matrix
corr = numerical_features.corr()
# Plot heatmap with annotations
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='turbo')
plt.title('Correlation Matrix — GDP Estimates')
plt.show()gdp_clean.sort_values('IMF_Estimate', ascending=False).head(10).plot(
kind='bar',
x='Country/Territory',
y='IMF_Estimate',
title='Top 10 Economies by IMF GDP per Capita Estimate',
figsize=(12, 6),
color='steelblue'
)
plt.ylabel('IMF GDP per Capita (USD)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()Screenshot 1 — Histogram: IMF Estimate Distribution
📸 Run
df.hist(column='IMF_Estimate', figsize=(6, 4)). Capture the histogram showing the distribution of GDP per capita values. The x-axis should show USD values and the y-axis should show frequency. The right-skew (most countries clustered at low values, with a long tail to the right) should be visible. Save asimages/viz_histogram_imf.png.
Screenshot 2 — Scatter Plot: WorldBank vs UN Estimates
📸 Run the scatter plot code. Capture the chart showing blue dots (WorldBank_Estimate on x-axis, UN_Estimate on y-axis) with a red trend line. The positive correlation (dots trending upward left to right) should be visible. The axis labels and title should be readable. Save as
images/viz_scatter_wb_un.png.
Screenshot 3 — Correlation Heatmap
📸 Run the Seaborn heatmap code. Capture the full heatmap showing the correlation matrix of all numeric columns, with colour coding and annotation values visible in each cell. The colour bar legend on the right should be visible. Save as
images/viz_heatmap_correlation.png.
Screenshot 4 — Top 10 Economies Bar Chart
📸 Run the bar chart code. Capture the full chart showing 10 countries on the x-axis (rotated labels) and IMF GDP per capita on the y-axis. Monaco, Liechtenstein, or Luxembourg should appear as the tallest bar. The title and axis labels should be clearly readable. Save as
images/viz_top10_economies.png.
- Finding 1: The correlation heatmap shows very high correlation between IMF, World Bank, and UN estimates (values close to 1.0), which confirms the three sources generally agree — but the scatter plot reveals individual countries where estimates diverge significantly, highlighting cases worth investigating.
- Finding 2: The histogram of IMF estimates is strongly right-skewed — the majority of countries have low GDP per capita while a small number of wealthy nations pull the mean far above the median. This is exactly the kind of distribution that makes mean-based comparisons misleading and motivates the use of median or IQR-based analysis.
This task demonstrates I can take cleaned data all the way through to visual output — choosing appropriate chart types for different analytical goals (distribution → histogram, relationship → scatter, correlation → heatmap, ranking → bar chart) and customising them to a professional standard. This full pipeline — load, clean, analyse, visualise — is the core deliverable of a junior data analyst role.
- Google Colab — cloud Python/Jupyter environment; Drive mounting, file upload/download
- pandas —
read_csv,head,tail,info,describe,isnull,dropna,groupby,pivot_table,apply,sort_values,rename,drop,to_csv,str.strip,pd.to_datetime,pd.to_numeric,quantile - NumPy —
np.polyfitfor trend line calculation - Matplotlib —
plt.scatter,plt.plot,plt.hist,df.plot(kind='bar'),plt.grid,plt.title,plt.xlabel,plt.ylabel - Seaborn —
sns.heatmap, correlation matrix visualisation - Core Python — lists, tuples, sets, dictionaries,
if/elif/else,forloops,whileloops,range(), modulo%, shortcut operators, custom functions withdef
| File | Description | Source |
|---|---|---|
student.csv |
Student names, class, gender, and marks | Bootcamp |
GDP__nominal__per_Capita.csv |
GDP per capita estimates (IMF, World Bank, UN) for ~200 countries | Bootcamp (Kaggle/Wikipedia) |
GDP_nominal_per_capita_clean.csv |
Cleaned version — NaN rows removed, types fixed | Generated in Task 1 |
Week4.ipynb |
Google Colab notebook — GDP cleaning and visualisation | Bootcamp Day 4 Task 1 |
W6D4_Task_for_workbook__1_.ipynb |
Google Colab notebook — Day 4 Task 2 visualisation | Bootcamp Day 4 Task 2 |



















