In [12]:
import pandas as pd 

In [13]:
# Load the dataset and check

rd_budget_df = pd.read_csv('data/rd_data.csv')
rd_budget_df.head()
print(rd_budget_df)

   department  1976_gdp1790000000000.0  1977_gdp2028000000000.0  \
0         DHS                      NaN                      NaN   
1         DOC             8.190000e+08             8.370000e+08   
2         DOD             3.569600e+10             3.796700e+10   
3         DOE             1.088200e+10             1.374100e+10   
4         DOT             1.142000e+09             1.095000e+09   
5         EPA             9.680000e+08             9.660000e+08   
6         HHS             9.226000e+09             9.507000e+09   
7    Interior             1.152000e+09             1.082000e+09   
8        NASA             1.251300e+10             1.255300e+10   
9         NIH             8.025000e+09             8.214000e+09   
10        NSF             2.372000e+09             2.395000e+09   
11      Other             1.191000e+09             1.280000e+09   
12       USDA             1.837000e+09             1.796000e+09   
13         VA             4.040000e+08             3.740000e+0

The dataset initially has 14 rows and 43 columns, with rows indicating each department and columns indicating each year and corresposnding budget. This layout is not convenient to read so we need to reshape the format of the dataset. 

In [14]:

# Reshape the dataset from a wide format to a long format 
rd_budget_long = rd_budget_df.melt(id_vars=["department"], var_name="year_gdp", value_name="budget")
print(rd_budget_long)

    department                  year_gdp        budget
0          DHS   1976_gdp1790000000000.0           NaN
1          DOC   1976_gdp1790000000000.0  8.190000e+08
2          DOD   1976_gdp1790000000000.0  3.569600e+10
3          DOE   1976_gdp1790000000000.0  1.088200e+10
4          DOT   1976_gdp1790000000000.0  1.142000e+09
..         ...                       ...           ...
583        NIH  2017_gdp19177000000000.0  3.305200e+10
584        NSF  2017_gdp19177000000000.0  6.040000e+09
585      Other  2017_gdp19177000000000.0  1.553000e+09
586       USDA  2017_gdp19177000000000.0  2.625000e+09
587         VA  2017_gdp19177000000000.0  1.367000e+09

[588 rows x 3 columns]


This code reshapes the dataset using melt function that I found online. The column department remains as it is in the new format. The year columns are combined into a single column called year_gdp. The values in the year columns are put into a new column called year_gdp. 
Now the dataset has 588 rows and 3 columns. The first column represents the department, the second is the year and third is the budget itself. This is still not very convenient and clean so we need to proceed forther. 

In [15]:
# Extract the year from the "year_gdp" column and add a new column "year"
rd_budget_long["year"] = rd_budget_long["year_gdp"].str.split("_").str[0]
# Drop the original "year_gdp" column
rd_budget_long = rd_budget_long.drop(columns=["year_gdp"])
# Convert the "year" column to integer type
rd_budget_long["year"] = pd.to_numeric(rd_budget_long["year"], errors="coerce")

print(rd_budget_long)

    department        budget  year
0          DHS           NaN  1976
1          DOC  8.190000e+08  1976
2          DOD  3.569600e+10  1976
3          DOE  1.088200e+10  1976
4          DOT  1.142000e+09  1976
..         ...           ...   ...
583        NIH  3.305200e+10  2017
584        NSF  6.040000e+09  2017
585      Other  1.553000e+09  2017
586       USDA  2.625000e+09  2017
587         VA  1.367000e+09  2017

[588 rows x 3 columns]


Here I cleaned the year column and made it numeric 