# Understanding the Data

In [2]:
# Load the necessary libraries
import pandas as pd
# Load your dataset
df = pd.read_csv('../dsi_team_22/data/raw/pharma.csv')

## Schema

| name | type | description |
|---|---|---|
| LOCATION | string (object) | Country code |
| TIME | number (int64) | Date in the form of %Y |
| PC_HEALTHXP | number (float64) | Percentage of health spending |
| PC_GDP | number (float64) | Percentage of GDP |
| USD_CAP | number (float64) | in USD per capita (using economy-wide PPPs) |
| FLAG_CODES | string (object) | Flag codes |
| TOTAL_SPEND | number (float64) | Total spending in millions |

## General Data Characteristics
Summarizations found in the Pharmaceutical Drug Spending by Countries dataset.

| question | analysis |
|----------|----------|
| How many countries are in this data set? | There are 36 countries in this data set. |
| How many years are in this data set? | There are 47 years in this data set. |
| What is the year range of this data set? | The data ranges from the years 1970 to 2016. |
| What is the total number of observations in the dataset? | There are 1036 observations in this data set |

#### Supporting Python Code:

In [28]:
# Find the number of unique countries
num_countries = df['LOCATION'].nunique()

# Find the number of unique years
num_years = df['TIME'].nunique()

# Determine the year range of the dataset
year_range = (df['TIME'].min(), df['TIME'].max())

# Find the total number of observations
total_observations = len(df)

# Display the results
print("Number of Countries:", num_countries)
print("Number of Years:", num_years)
print("Year Range:", year_range)
print("Total Observations:", total_observations)

Number of Countries: 36
Number of Years: 47
Year Range: (1970, 2016)
Total Observations: 1036


## Pharma Spending Statistics

### Average (Mean) across all 36 countries
| column | average | unit |
|--------|---------|------|
| Total Spend | 11765.42 | USD millions |
| USD CAP | 295.05 | in USD per capita (using economy-wide PPPs) |
| Percentage Health Spending | 16.41 | % |
| Percentage GDP | 1.17 | % |

#### Supporting Python Code:

In [None]:
# Identify Pharma Data Set Averages
avg_total_spend = df['TOTAL_SPEND'].mean()
avg_usd_cap = df['USD_CAP'].mean()
avg_pc_healthxp = df['PC_HEALTHXP'].mean()
avg_pc_gdp = df['PC_GDP'].mean()

# Display the results
print("Average Total Spend:", avg_total_spend)
print("Average USD CAP:",avg_usd_cap)
print("Average PC Health Spending:",avg_pc_healthxp)
print("Average Percentage GDP:",avg_pc_gdp)

## Country Spending Overview

| question | analysis |
|----------|----------|
| Which country has the highest total health expenditure? | United States |
| Which country has the lowest total health expenditure? | Iceland |


### What are the top 10 highest spending countries?

| rank | country code | country name | total spend |
|------|--------------|--------------|-------------|
| 1 | USA | United States | 4186292.78 |
| 2 | JPN | Japan | 1602492.93 |
| 3 | DEU | Germany | 1188168.49 |
| 4 | FRA | France | 802298.94 |
| 5 | ITA | Italy | 754377.19 |
| 6 | MEX | Mexico | 451903.24 |
| 7 | CAN | Canada | 445157.79 |
| 8 | ESP | Spain | 436073.35 |
| 9 | KOR | South Korea | 413876.98 |
| 10 | GBR | United Kingdom | 243390.12 |

#### Supporting Python Code:


In [None]:
# Identify the Top 10 highest spending countries
top_10_highest_spending = df.groupby('LOCATION')['TOTAL_SPEND'].sum().sort_values(ascending=False).head(10)

# Display the results
print("Top 10 Highest Spending Countries:")
print(top_10_highest_spending)

### What are the bottom 10 lowest spending countries?

| rank | country code | country name | total spend |
|------|--------------|--------------|-------------|
| 36 | ISL | Iceland | 3752.89 |
| 35 | LUX | Luxembourg | 4858.94 |
| 34 | EST | Estonia | 5179.24 |
| 33 | LVA | Latvia | 6829.89 |
| 32 | NZL | New Zealand | 11907.15 |
| 31 | SVN | Slovenia | 12694.20 |
| 30 | LTU | Lithuania | 13859.94 |
| 29 | ISR | Israel | 16654.40 |
| 28 | TUR | Turkey | 25138.33 |
| 27 | SVK | Slovakia | 38472.30 |


#### Supporting Python Code:

In [None]:
# Identify the Bottom 10 lowest spending countries
bottom_10_lowest_spending = df.groupby('LOCATION')['TOTAL_SPEND'].sum().sort_values(ascending=True).head(10)

# Display the results
print("Bottom 10 Lowest Spending Countries:")
print(bottom_10_lowest_spending)

### Mid Level Spending Countries:

| rank | country code | country name | total spend |
|------|--------------|--------------|-------------|
| 11 | AUS | Australia | 210562.01 |
| 12 | POL | Poland | 155643.45 |
| 13 | NLD | Netherlands | 147102.32 |
| 14 | BEL | Belgium | 126330.37 |
| 15 | GRC | Greece | 113206.42 |
| 16 | PRT | Portugal | 106881.27 |
| 17 | CHE | Switzerland | 103005.12 |
| 18 | SWE | Sweden | 93430.58 |
| 19 | AUT | Austria | 82723.22 |
| 20 | HUN | Hungary | 81580.59 |
| 21 | CZE | Czech Republic | 78489.23 |
| 22 | FIN | Finland | 52228.15 |
| 23 | IRL | Ireland | 46876.19 |
| 24 | RUS | Russia | 44655.62 |
| 25 | NOR | Norway | 44315.79 |
| 26 | DNK | Denmark | 38568.93 |

#### Supporting Python Code:

In [None]:
# Calculate the total spend for each country
total_spend_by_country = df.groupby('LOCATION')['TOTAL_SPEND'].sum()

# Identify the top 10 and bottom 10 countries
top_10_countries = total_spend_by_country.nlargest(10)
bottom_10_countries = total_spend_by_country.nsmallest(10)

# Filter out the top 10 and bottom 10 countries
remaining_countries = total_spend_by_country.drop(top_10_countries.index).drop(bottom_10_countries.index).sort_values(ascending=False)

# Display the results
print("\nCountries Not in Top 10 or Bottom 10:")
print(remaining_countries)