## **[2025 Data Science task](https://github.com/CEMA-training/internship_task_dscience)**

In [111]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from matplotlib import pyplot as plt
%matplotlib inline


#html export
import plotly.io as pio
pio.renderers.default = 'notebook'

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## **Dataset creation**

In [112]:
hiv_df = pd.read_csv(r'HIV data 2000-2023.csv', encoding='ISO-8859-1')
hiv_df.head()

Unnamed: 0,IndicatorCode,Indicator,ValueType,ParentLocationCode,ParentLocation,Location type,SpatialDimValueCode,Location,Period type,Period,Value
0,HIV_0000000001,Estimated number of people (all ages) living w...,numeric,AFR,Africa,Country,AGO,Angola,Year,2023,320 000 [280 000 - 380 000]
1,HIV_0000000001,Estimated number of people (all ages) living w...,numeric,AFR,Africa,Country,AGO,Angola,Year,2022,320 000 [280 000 - 380 000]
2,HIV_0000000001,Estimated number of people (all ages) living w...,numeric,AFR,Africa,Country,AGO,Angola,Year,2021,320 000 [280 000 - 380 000]
3,HIV_0000000001,Estimated number of people (all ages) living w...,numeric,AFR,Africa,Country,AGO,Angola,Year,2020,320 000 [280 000 - 370 000]
4,HIV_0000000001,Estimated number of people (all ages) living w...,numeric,AFR,Africa,Country,AGO,Angola,Year,2015,300 000 [260 000 - 350 000]


### Attributes

- **IndicatorCode**: A unique identifier for the indicator being measured (e.g., `"HIV_0000000001"` for the estimated number of people living with HIV).

- **Indicator**: A description of the indicator being measured (e.g., `"Estimated number of people (all ages) living with HIV"`).

- **ValueType**: Specifies the type of data recorded (e.g., `"numeric"` for numerical values).

- **ParentLocationCode**: A code representing the broader geographical region to which the location belongs (e.g., `"AFR"` for Africa).

- **ParentLocation**: The name of the broader geographical region (e.g., `"Africa"`).

- **Location type**: Describes the type of location (e.g., `"Country"`).

- **SpatialDimValueCode**: A unique code for the specific location (e.g., `"AGO"` for Angola).

- **Location**: The name of the specific location (e.g., `"Angola"`).

- **Period type**: Specifies the type of time period (e.g., `"Year"`).

- **Period**: The year for which the data is recorded (e.g., `"2023"`).

- **Value**: The estimated number of people living with HIV, often including a range (e.g., `"320 000 [280 000 - 380 000]"` for Angola in 2023).


This dataset provides detailed information on HIV prevalence across various countries and regions, with data spanning multiple years (2000–2023). The `Value` field is particularly important as it contains the estimated figures, often accompanied by confidence intervals. The dataset is structured to allow analysis by region, country, and year.

Since the dataset focuses on an estimated number of people living with HIV, the columns `IndicatorCode`, `Indicator` and `ValueType` are redundant and do not add analytical value to trend analysis hence we drop them

In [113]:
# deleting redundant columns that do not add analytic value 
del hiv_df['IndicatorCode']
del hiv_df['Indicator']
del hiv_df['ValueType']

In [114]:
hiv_df

Unnamed: 0,ParentLocationCode,ParentLocation,Location type,SpatialDimValueCode,Location,Period type,Period,Value
0,AFR,Africa,Country,AGO,Angola,Year,2023,320 000 [280 000 - 380 000]
1,AFR,Africa,Country,AGO,Angola,Year,2022,320 000 [280 000 - 380 000]
2,AFR,Africa,Country,AGO,Angola,Year,2021,320 000 [280 000 - 380 000]
3,AFR,Africa,Country,AGO,Angola,Year,2020,320 000 [280 000 - 370 000]
4,AFR,Africa,Country,AGO,Angola,Year,2015,300 000 [260 000 - 350 000]
...,...,...,...,...,...,...,...,...
1547,WPR,Western Pacific,Country,WSM,Samoa,Year,2020,No data
1548,WPR,Western Pacific,Country,WSM,Samoa,Year,2015,No data
1549,WPR,Western Pacific,Country,WSM,Samoa,Year,2010,No data
1550,WPR,Western Pacific,Country,WSM,Samoa,Year,2005,No data


In [115]:
hiv_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1552 entries, 0 to 1551
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   ParentLocationCode   1552 non-null   object
 1   ParentLocation       1552 non-null   object
 2   Location type        1552 non-null   object
 3   SpatialDimValueCode  1552 non-null   object
 4   Location             1552 non-null   object
 5   Period type          1552 non-null   object
 6   Period               1552 non-null   int64 
 7   Value                1552 non-null   object
dtypes: int64(1), object(7)
memory usage: 97.1+ KB


### Cleaning

In [116]:
# duplicates
hiv_df.duplicated().sum()

0

In [117]:
# nulls
hiv_df.isna().sum()

ParentLocationCode     0
ParentLocation         0
Location type          0
SpatialDimValueCode    0
Location               0
Period type            0
Period                 0
Value                  0
dtype: int64

There are no duplicates nor null values. However, some entries of the `Value` field have "No data" or placeholder values like "<200" for very small estimates so we need to clean it to have single values

In [118]:
# entries with the value "No data"
no_data = hiv_df[hiv_df['Value'] == 'No data']
no_data

Unnamed: 0,ParentLocationCode,ParentLocation,Location type,SpatialDimValueCode,Location,Period type,Period,Value
40,AFR,Africa,Country,CAF,Central African Republic,Year,2023,No data
41,AFR,Africa,Country,CAF,Central African Republic,Year,2022,No data
42,AFR,Africa,Country,CAF,Central African Republic,Year,2021,No data
43,AFR,Africa,Country,CAF,Central African Republic,Year,2020,No data
44,AFR,Africa,Country,CAF,Central African Republic,Year,2015,No data
...,...,...,...,...,...,...,...,...
1547,WPR,Western Pacific,Country,WSM,Samoa,Year,2020,No data
1548,WPR,Western Pacific,Country,WSM,Samoa,Year,2015,No data
1549,WPR,Western Pacific,Country,WSM,Samoa,Year,2010,No data
1550,WPR,Western Pacific,Country,WSM,Samoa,Year,2005,No data


In [119]:
# countries with No data on people living with HIV
no_data['Location'].unique()

array(['Central African Republic', 'Cameroon', 'Equatorial Guinea',
       'Sao Tome and Principe', 'Seychelles', 'Antigua and Barbuda',
       'Canada', 'Dominica', 'Grenada', 'Saint Kitts and Nevis',
       'Saint Lucia', 'Trinidad and Tobago', 'United States of America',
       'Saint Vincent and the Grenadines', 'Bahrain', 'Andorra',
       'Austria', 'Belgium', 'Cyprus', 'Germany', 'Finland',
       'United Kingdom of Great Britain and Northern Ireland', 'Hungary',
       'Monaco', 'Netherlands (Kingdom of the)', 'Norway', 'Poland',
       'Russian Federation', 'San Marino', 'Sweden', 'Turkmenistan',
       'T\x9frkiye', 'Ukraine', 'Uzbekistan', 'India', 'Maldives',
       "Democratic People's Republic of Korea", 'Brunei Darussalam',
       'China', 'Cook Islands', 'Micronesia (Federated States of)',
       'Japan', 'Kiribati', 'Republic of Korea', 'Marshall Islands',
       'Niue', 'Nauru', 'Palau', 'Solomon Islands', 'Tonga', 'Tuvalu',
       'Vanuatu', 'Samoa'], dtype=object)

In [120]:
# cleaning our dataset entries with the value "No data"
hiv_df = hiv_df[hiv_df['Value'] != 'No data']
hiv_df

Unnamed: 0,ParentLocationCode,ParentLocation,Location type,SpatialDimValueCode,Location,Period type,Period,Value
0,AFR,Africa,Country,AGO,Angola,Year,2023,320 000 [280 000 - 380 000]
1,AFR,Africa,Country,AGO,Angola,Year,2022,320 000 [280 000 - 380 000]
2,AFR,Africa,Country,AGO,Angola,Year,2021,320 000 [280 000 - 380 000]
3,AFR,Africa,Country,AGO,Angola,Year,2020,320 000 [280 000 - 370 000]
4,AFR,Africa,Country,AGO,Angola,Year,2015,300 000 [260 000 - 350 000]
...,...,...,...,...,...,...,...,...
1531,WPR,Western Pacific,Country,VNM,Viet Nam,Year,2020,250 000 [230 000 - 270 000]
1532,WPR,Western Pacific,Country,VNM,Viet Nam,Year,2015,240 000 [210 000 - 260 000]
1533,WPR,Western Pacific,Country,VNM,Viet Nam,Year,2010,210 000 [190 000 - 230 000]
1534,WPR,Western Pacific,Country,VNM,Viet Nam,Year,2005,180 000 [150 000 - 200 000]


We have dropped 394 columns with 'No Data' as their value for people living with HIV

In [121]:
# extract the central estimate (320000) as a number for values with the format: 320 000 [280 000 - 380 000]

def extract_value(val):
    if isinstance(val, str):
        # Handle values like "<500" at the beginning
        if val.startswith('<'):
            number = int(val[1:].split()[0])  # take just the number part before any space
            return number - 1  # assume just under that number
        else:
            # Extract the number at the start if it's not a "<" value
            match = pd.Series(val).str.extract(r'^([\d\s]+)').iloc[0, 0]
            if match:
                return float(match.replace(' ', ''))  # remove spaces, convert to float
    return None  # fallback if no match

# Apply to value column and convert to int
hiv_df.loc[:, 'Value'] = hiv_df['Value'].apply(extract_value)
hiv_df['Value'] = hiv_df['Value'].astype(int)


hiv_df



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,ParentLocationCode,ParentLocation,Location type,SpatialDimValueCode,Location,Period type,Period,Value
0,AFR,Africa,Country,AGO,Angola,Year,2023,320000
1,AFR,Africa,Country,AGO,Angola,Year,2022,320000
2,AFR,Africa,Country,AGO,Angola,Year,2021,320000
3,AFR,Africa,Country,AGO,Angola,Year,2020,320000
4,AFR,Africa,Country,AGO,Angola,Year,2015,300000
...,...,...,...,...,...,...,...,...
1531,WPR,Western Pacific,Country,VNM,Viet Nam,Year,2020,250000
1532,WPR,Western Pacific,Country,VNM,Viet Nam,Year,2015,240000
1533,WPR,Western Pacific,Country,VNM,Viet Nam,Year,2010,210000
1534,WPR,Western Pacific,Country,VNM,Viet Nam,Year,2005,180000


In [122]:
hiv_df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 1158 entries, 0 to 1535
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   ParentLocationCode   1158 non-null   object
 1   ParentLocation       1158 non-null   object
 2   Location type        1158 non-null   object
 3   SpatialDimValueCode  1158 non-null   object
 4   Location             1158 non-null   object
 5   Period type          1158 non-null   object
 6   Period               1158 non-null   int64 
 7   Value                1158 non-null   int32 
dtypes: int32(1), int64(1), object(6)
memory usage: 76.9+ KB
