## A. Getting Started With Python & Data Analysis
#### Data is the core of data science, hence, scoping and collecting the right data for a project is very crucial to achieving the required results. A complete Data Science Pipeline involves 
#### 1. Data Scoping
#### 2. Data Review
#### 3. Feature engineering 
#### 4. Feature Review 
#### 5. Model Selection and review 
#### 6. Model Evaluation and Insights
#### 7. Interaction Production 
#### 8. Feedback

#### Conducting Exploratory Data Analysis (EDA) on the cleaned data using visualisations and statistical methods gives a quick insight into the various patterns and relationships between features in the dataset. Modelling involves using statistical and machine learning methods for classifying and clustering the processed data to create predictive models. Several evaluation methods are employed to compare the performance of these models and continuously improve before a final model is selected.

#### For the most part, the data science pipeline is not a linear process; it’s instead an iterative process.

#### Data can be presented in different forms such as CSV, JSON, Excel files, database etc. Python is very efficient in processing and wrangling most data types. The libraries Include Numpy, Pandas , Matplotlib , Scikit-Learn and TensorFlow.

#### Jupyter notebook is an interactive web environment that supports many programming languages including Python and R, allowing for explanatory text, images and visualisation.

### B. Introduction to NumPy & Creating Arrays.

#### NumPy is a library that has ndarray as its basic data structure used to handle arrays and matrices. A NumPy array has a grid of values all of which are of the same data type, mostly integers and floats. These arrays can also be created from Python lists.

In [2]:
#Importing Numpy library 
import numpy as np 

arr = [1,2,3,4]  #Created a simple list and assigned it to a variable 

print (arr)
print (type(arr))

[1, 2, 3, 4]
<class 'list'>


In [3]:
#Converting the list arr to an Array 
a = np.array(arr)

print(type(a)) # The type of ellemnt which is a numpy array

print(a.shape) # The shape of the array which is (4,0)

print(a.dtype) # The type of data in the array which is int

print(a.ndim) # The number of dimension which is 1

<class 'numpy.ndarray'>
(4,)
int32
1


In [4]:
# Lets create a two dimentional array
b= np.array([[1,2,3,4],[5,6,7,8]])

print(b.shape) # The shape of the array where the first dimension has 2 elements and the second has 4.

print(b.ndim) # The dimension of the array

(2, 4)
2


#### There are also some inbuilt functions that can be used to initialize numpy which include empty(), zeros(), ones(), full(), random.random().

In [5]:
zero_array = np.zeros(5) #Takes the number of zeros as an argument 

print(zero_array)

[0. 0. 0. 0. 0.]


In [6]:
empty_array =np.empty([2,2]) #Takes an array or integer as an argument

print(empty_array)

[[-1.10380189e-282  2.52625418e+286]
 [ 6.65259379e-301  2.21764760e-301]]


In [100]:
one_array = np.ones([2,3])

one_array2 = np.ones(5)

print(one_array)
print('-----------')
print(one_array2)

[[1. 1. 1.]
 [1. 1. 1.]]
-----------
[1. 1. 1. 1. 1.]


In [8]:
np.full((2,2),10)

array([[10, 10],
       [10, 10]])

In [9]:
np.full((2,2),[1,2])

array([[1, 2],
       [1, 2]])

In [10]:
np.random.random((2,3))

array([[0.12813459, 0.25392002, 0.31635468],
       [0.49618729, 0.77583433, 0.46724096]])

### C. Intra-operability of Arrays and Scalars.

#### This allows for batch arithmetic operations on the arrays by applying the operator elementwise. Similarly, scalars are also propagated element-wise across an array. For arrays with different sizes, it is impossible to perform element-wise operations instead; numpy handles this by broadcasting provided the dimensions of the arrays are the same or, one of the dimensions of the array is 1

In [11]:
c = np.array([[1.0,2.0,3.0],[4.0,5.0,6.0]])
d = np.array([[2.0,4.0,8.0],[1.0,3.0,6.0]])

print(c)
print(d)

[[1. 2. 3.]
 [4. 5. 6.]]
[[2. 4. 8.]
 [1. 3. 6.]]


In [12]:
c+d #Addition operator 

array([[ 3.,  6., 11.],
       [ 5.,  8., 12.]])

In [13]:
c-d # subtraction operator

array([[-1., -2., -5.],
       [ 3.,  2.,  0.]])

In [14]:
d/5 #Divide

array([[0.4, 0.8, 1.6],
       [0.2, 0.6, 1.2]])

In [15]:
c**2 #Power

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

### D. Indexing With Arrays & Using Arrays for Data Processing

In [16]:
a[2]

3

In [17]:
b[0,0] #Enter list 0 and give me the element on index zero

1

In [18]:
b[1,2] #Enter the list 1 and give me the second element 

7

#### Array Slicing 

In [20]:
d

array([[2., 4., 8.],
       [1., 3., 6.]])

In [21]:
d[1, :2]

array([1., 3.])

In [23]:
e=np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[11,12,13,14]])
e

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [11, 12, 13, 14]])

In [24]:
e[:3 , :2] # Means give me index 0 to 2 list and their inddex 0 to 1 elemnets 

array([[ 1,  2],
       [ 5,  6],
       [ 9, 10]])

In [25]:
e.sum() # sum of all elements

128

In [26]:
e[1].sum()

26

In [27]:
e[1].mean() #Mean of elemtns in index 0 (a list)

6.5

In [32]:
e.std()  #look up the formula for standard deviation

4.0

![image.png](attachment:image.png)

In [29]:
e[1].min() # Prints the minimun value

5

In [31]:
np.corrcoef(e)  #Browse more on Correlation and how it works

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

### E. File input and output with Arrays 

#### Numpy arrays can be loaded from and saved to binary files with .npy as the extension using load() and save() respectively. This can also be done with text files with text files using loadtxt() and savetxt().

In [33]:
np.save('testfile', np.array([[1, 2, 3], [4, 5, 6]]))

In [34]:
np.load('testfile.npy')

array([[1, 2, 3],
       [4, 5, 6]])

In [37]:
from io import StringIO

newstring = StringIO("0 1\n2 3")

np.loadtxt(newstring)

array([[0., 1.],
       [2., 3.]])

## Pandas - So Much More Than A Cute Animal

#### Pandas is a library  built on Numpy which is used for data manipulation, with other ways of indexing other than integers. Series, DataFrame, and index are the basic data structures in this library.  Series in pandas can be referred to as a one dimensional array with homogenous elements of different types somewhat similar to numpy arrays; however, it can be indexed differently with specified descriptive labels or integers.

In [39]:
import pandas as pd

days =pd.Series(['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
days

0       Monday
1      Tuesday
2    Wednesday
3     Thursday
4       Friday
5     Saturday
6       Sunday
dtype: object

In [41]:
#You convert a numpy to a dataframe 
new_days = np.array(['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])

new_df_days = pd.Series (new_days)

new_df_days

0       Monday
1      Tuesday
2    Wednesday
3     Thursday
4       Friday
5     Saturday
6       Sunday
dtype: object

In [43]:
#Using Strings as Indexes

days = pd.Series(['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'],['a','b','c','d','e','f','g'])
days

a       Monday
b      Tuesday
c    Wednesday
d     Thursday
e       Friday
f     Saturday
g       Sunday
dtype: object

In [44]:
#parcing a Dictionary

days = pd.Series({'a':'Monday','b':'Tuesday','c':'Wednesday','d':'Thursday','e':'Friday','f':'Saturday','g':'Sunday'})
days

a       Monday
b      Tuesday
c    Wednesday
d     Thursday
e       Friday
f     Saturday
g       Sunday
dtype: object

### Data Frames

In [46]:
print(pd.DataFrame())

Empty DataFrame
Columns: []
Index: []


In [47]:
#Creating out of a dictionary
df_dict ={
        'Country':['Nigeria','Kenya','Togo','Ghana'],
         'Capital':['Abuja','Nairobi','Lome','Accra'],
         'Population':[10000,8500,3500,1200],
         'Age':[60,50,70,80]
         }

df = pd.DataFrame(df_dict, index = [2,4,6,8])

df

Unnamed: 0,Country,Capital,Population,Age
2,Nigeria,Abuja,10000,60
4,Kenya,Nairobi,8500,50
6,Togo,Lome,3500,70
8,Ghana,Accra,1200,80


In [48]:
#Creating out of lists
df_list = [['Nigeria','Kenya','Togo','Ghana'],
           ['Abuja','Nairobi','Lome','Accra'],
          [10000,8500,3500,1200],
          [60,50,70,80]
          ]
df1 = pd.DataFrame(df_list, columns = ['Country','Capital','Population','Age'], index = [2,4,6,8])

df1

Unnamed: 0,Country,Capital,Population,Age
2,Nigeria,Kenya,Togo,Ghana
4,Abuja,Nairobi,Lome,Accra
6,10000,8500,3500,1200
8,60,50,70,80


#### at, iat, iloc and loc are accessors used to retrieve data in dataframes. iloc selects values from the rows and columns by using integer index to locate positions, while loc selects rows or columns using labels. at and iat are used to retrieve single values such that at uses the column and row labels and iat uses indices.

In [49]:
df.iloc[3]

Country       Ghana
Capital       Accra
Population     1200
Age              80
Name: 8, dtype: object

In [51]:
df.loc[6]

Country       Togo
Capital       Lome
Population    3500
Age             70
Name: 6, dtype: object

In [52]:
df['Capital']

2      Abuja
4    Nairobi
6       Lome
8      Accra
Name: Capital, dtype: object

In [53]:
df['Population'].sum()

23200

In [54]:
df['Country'].count()

4

In [55]:
df.mean()

Population    5800.0
Age             65.0
dtype: float64

In [56]:
df.describe()

Unnamed: 0,Population,Age
count,4.0,4.0
mean,5800.0,65.0
std,4138.437708,12.909944
min,1200.0,50.0
25%,2925.0,57.5
50%,6000.0,65.0
75%,8875.0,72.5
max,10000.0,80.0


#### The missing data enigma: Importance, types and handling missing data.

In [57]:
df

Unnamed: 0,Country,Capital,Population,Age
2,Nigeria,Abuja,10000,60
4,Kenya,Nairobi,8500,50
6,Togo,Lome,3500,70
8,Ghana,Accra,1200,80


In [60]:
df.at[2,'Country']=np.nan #Lets assign a null value to nigeria in the datframe

In [61]:
df

Unnamed: 0,Country,Capital,Population,Age
2,,Abuja,10000,60
4,Kenya,Nairobi,8500,50
6,Togo,Lome,3500,70
8,Ghana,Accra,1200,80


In [62]:
df.isnull()

Unnamed: 0,Country,Capital,Population,Age
2,True,False,False,False
4,False,False,False,False
6,False,False,False,False
8,False,False,False,False


In [67]:
df.dropna()

Unnamed: 0,Country,Capital,Population,Age
4,Kenya,Nairobi,8500,50
6,Togo,Lome,3500,70
8,Ghana,Accra,1200,80


#### Pandas represent missing values as NA or NaN which can be filled, removed, and detected with functions like fillna(), dropna(), isnull(), notnull(), replace().

#### Numpy, Pandas has some functions that provide descriptive statistics such as the measures of central tendency, dispersion, skewness and kurtosis, correlation and multicollinearity. Some functions are mode(), median(), mean(), sum(), std(), var(), skew(), kurt() and min(). 

## Data Types & Data Wrangling

#### Fomat for saving and loading a data set

![image.png](attachment:image.png)


In [84]:
fuel_data=pd.read_csv('https://github.com/WalePhenomenon/climate_change/blob/master/fuel_ferc1.csv?raw=true',error_bad_lines=False)
fuel_data.head()

Unnamed: 0,record_id,utility_id_ferc1,report_year,plant_name_ferc1,fuel_type_code_pudl,fuel_unit,fuel_qty_burned,fuel_mmbtu_per_unit,fuel_cost_per_unit_burned,fuel_cost_per_unit_delivered,fuel_cost_per_mmbtu
0,f1_fuel_1994_12_1_0_7,1,1994,rockport,coal,ton,5377489.0,16.59,18.59,18.53,1.121
1,f1_fuel_1994_12_1_0_10,1,1994,rockport total plant,coal,ton,10486945.0,16.592,18.58,18.53,1.12
2,f1_fuel_1994_12_2_0_1,2,1994,gorgas,coal,ton,2978683.0,24.13,39.72,38.12,1.65
3,f1_fuel_1994_12_2_0_7,2,1994,barry,coal,ton,3739484.0,23.95,47.21,45.99,1.97
4,f1_fuel_1994_12_2_0_10,2,1994,chickasaw,gas,mcf,40533.0,1.0,2.77,2.77,2.57


In [94]:
fuel_data.describe()

Unnamed: 0,utility_id_ferc1,report_year,fuel_qty_burned,fuel_mmbtu_per_unit,fuel_cost_per_unit_burned,fuel_cost_per_unit_delivered,fuel_cost_per_mmbtu
count,29523.0,29523.0,29523.0,29523.0,29523.0,29523.0,29523.0
mean,118.601836,2005.80605,2622119.0,8.492111,208.649031,917.5704,19.304354
std,74.178353,7.025483,9118004.0,10.60022,2854.49009,68775.93,2091.540939
min,1.0,1994.0,1.0,1e-06,-276.08,-874.937,-41.501
25%,55.0,2000.0,13817.0,1.024,5.207,3.7785,1.94
50%,122.0,2006.0,253322.0,5.762694,26.0,17.371,4.127
75%,176.0,2012.0,1424034.0,17.006,47.113,42.137,7.745
max,514.0,2018.0,555894200.0,341.26,139358.0,7964521.0,359278.0


In [85]:
#Check for missing values 
    
fuel_data.isnull().sum()

record_id                         0
utility_id_ferc1                  0
report_year                       0
plant_name_ferc1                  0
fuel_type_code_pudl               0
fuel_unit                       180
fuel_qty_burned                   0
fuel_mmbtu_per_unit               0
fuel_cost_per_unit_burned         0
fuel_cost_per_unit_delivered      0
fuel_cost_per_mmbtu               0
dtype: int64

In [92]:
fuel_data.groupby('fuel_unit')['fuel_unit'].count() #Count the unique fuel units and check which is the most

fuel_unit
bbl        7998
gal          84
gramsU      464
kgU         110
mcf       11354
mmbtu       180
mwdth        95
mwhth       100
ton        8958
Name: fuel_unit, dtype: int64

In [93]:
#Fill the missing values with the most common value above

fuel_data[['fuel_unit']]=fuel_data[['fuel_unit']].fillna('mcf')
#Check for missing values again
    
fuel_data.isnull().sum()

record_id                       0
utility_id_ferc1                0
report_year                     0
plant_name_ferc1                0
fuel_type_code_pudl             0
fuel_unit                       0
fuel_qty_burned                 0
fuel_mmbtu_per_unit             0
fuel_cost_per_unit_burned       0
fuel_cost_per_unit_delivered    0
fuel_cost_per_mmbtu             0
dtype: int64

In [95]:
fuel_data.groupby('report_year')['report_year'].count()

report_year
1994    1235
1995    1201
1996    1088
1997    1094
1998    1107
1999    1050
2000    1373
2001    1356
2002    1205
2003    1211
2004    1192
2005    1269
2006    1243
2007    1264
2008    1228
2009    1222
2010    1261
2011    1240
2012    1243
2013    1199
2014    1171
2015    1093
2016    1034
2017     993
2018     951
Name: report_year, dtype: int64

In [98]:
fuel_data.groupby('fuel_type_code_pudl').first()

Unnamed: 0_level_0,record_id,utility_id_ferc1,report_year,plant_name_ferc1,fuel_unit,fuel_qty_burned,fuel_mmbtu_per_unit,fuel_cost_per_unit_burned,fuel_cost_per_unit_delivered,fuel_cost_per_mmbtu
fuel_type_code_pudl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
coal,f1_fuel_1994_12_1_0_7,1,1994,rockport,ton,5377489.0,16.59,18.59,18.53,1.121
gas,f1_fuel_1994_12_2_0_10,2,1994,chickasaw,mcf,40533.0,1.0,2.77,2.77,2.57
nuclear,f1_fuel_1994_12_2_1_1,2,1994,joseph m. farley,kgU,2260.0,0.064094,28.77,0.0,0.45
oil,f1_fuel_1994_12_6_0_2,6,1994,clinch river,bbl,6510.0,5.875338,32.13,23.444,5.469
other,f1_fuel_1994_12_11_0_6,11,1994,w.f. wyman,bbl,55652.0,0.149719,14.685,15.09,2.335
waste,f1_fuel_1994_12_9_0_3,9,1994,b.l. england,ton,2438.0,0.015939,34.18,34.18,1.072


#### Merging in Pandas can be likened to join operations in relational databases like SQL

In [107]:
#Split the fuel data into two and reset the index

fuel_df1= fuel_data.iloc[:19000].reset_index(drop=True)

fuel_df2= fuel_data.iloc[19000:].reset_index(drop=True)

In [110]:
fuel_df1

Unnamed: 0,record_id,utility_id_ferc1,report_year,plant_name_ferc1,fuel_type_code_pudl,fuel_unit,fuel_qty_burned,fuel_mmbtu_per_unit,fuel_cost_per_unit_burned,fuel_cost_per_unit_delivered,fuel_cost_per_mmbtu
0,f1_fuel_1994_12_1_0_7,1,1994,rockport,coal,ton,5377489.0,16.590000,18.590,18.530,1.121
1,f1_fuel_1994_12_1_0_10,1,1994,rockport total plant,coal,ton,10486945.0,16.592000,18.580,18.530,1.120
2,f1_fuel_1994_12_2_0_1,2,1994,gorgas,coal,ton,2978683.0,24.130000,39.720,38.120,1.650
3,f1_fuel_1994_12_2_0_7,2,1994,barry,coal,ton,3739484.0,23.950000,47.210,45.990,1.970
4,f1_fuel_1994_12_2_0_10,2,1994,chickasaw,gas,mcf,40533.0,1.000000,2.770,2.770,2.570
...,...,...,...,...,...,...,...,...,...,...,...
18995,f1_fuel_2009_12_182_1_9,182,2009,lake road,gas,mcf,340857.0,1.000000,4.711,4.711,4.711
18996,f1_fuel_2009_12_182_1_10,182,2009,lake road,oil,mcf,771.0,5.801544,84.899,84.899,14.634
18997,f1_fuel_2009_12_182_1_13,182,2009,iatan (18%),coal,ton,414142.0,16.718000,18.509,17.570,1.107
18998,f1_fuel_2009_12_182_1_14,182,2009,iatan (18%),oil,bbl,5761.0,5.537910,83.636,72.280,15.102


In [111]:
fuel_df1

Unnamed: 0,record_id,utility_id_ferc1,report_year,plant_name_ferc1,fuel_type_code_pudl,fuel_unit,fuel_qty_burned,fuel_mmbtu_per_unit,fuel_cost_per_unit_burned,fuel_cost_per_unit_delivered,fuel_cost_per_mmbtu
0,f1_fuel_1994_12_1_0_7,1,1994,rockport,coal,ton,5377489.0,16.590000,18.590,18.530,1.121
1,f1_fuel_1994_12_1_0_10,1,1994,rockport total plant,coal,ton,10486945.0,16.592000,18.580,18.530,1.120
2,f1_fuel_1994_12_2_0_1,2,1994,gorgas,coal,ton,2978683.0,24.130000,39.720,38.120,1.650
3,f1_fuel_1994_12_2_0_7,2,1994,barry,coal,ton,3739484.0,23.950000,47.210,45.990,1.970
4,f1_fuel_1994_12_2_0_10,2,1994,chickasaw,gas,mcf,40533.0,1.000000,2.770,2.770,2.570
...,...,...,...,...,...,...,...,...,...,...,...
18995,f1_fuel_2009_12_182_1_9,182,2009,lake road,gas,mcf,340857.0,1.000000,4.711,4.711,4.711
18996,f1_fuel_2009_12_182_1_10,182,2009,lake road,oil,mcf,771.0,5.801544,84.899,84.899,14.634
18997,f1_fuel_2009_12_182_1_13,182,2009,iatan (18%),coal,ton,414142.0,16.718000,18.509,17.570,1.107
18998,f1_fuel_2009_12_182_1_14,182,2009,iatan (18%),oil,bbl,5761.0,5.537910,83.636,72.280,15.102


In [112]:
# Check the length of the data sets and add them up

In [121]:
assert len(fuel_data) == (len(fuel_df1)+ len(fuel_df2)) , "They are not the same"
print("They match and are equal")

They match and are equal


In [122]:
#Inner join will loose rows that dont match

pd.merge(fuel_df1,fuel_df2, how='inner')

Unnamed: 0,record_id,utility_id_ferc1,report_year,plant_name_ferc1,fuel_type_code_pudl,fuel_unit,fuel_qty_burned,fuel_mmbtu_per_unit,fuel_cost_per_unit_burned,fuel_cost_per_unit_delivered,fuel_cost_per_mmbtu


In [123]:
#outer join will join all rows that dont match

pd.merge(fuel_df1,fuel_df2, how='outer')

Unnamed: 0,record_id,utility_id_ferc1,report_year,plant_name_ferc1,fuel_type_code_pudl,fuel_unit,fuel_qty_burned,fuel_mmbtu_per_unit,fuel_cost_per_unit_burned,fuel_cost_per_unit_delivered,fuel_cost_per_mmbtu
0,f1_fuel_1994_12_1_0_7,1,1994,rockport,coal,ton,5377489.0,16.590,18.59,18.53,1.121
1,f1_fuel_1994_12_1_0_10,1,1994,rockport total plant,coal,ton,10486945.0,16.592,18.58,18.53,1.120
2,f1_fuel_1994_12_2_0_1,2,1994,gorgas,coal,ton,2978683.0,24.130,39.72,38.12,1.650
3,f1_fuel_1994_12_2_0_7,2,1994,barry,coal,ton,3739484.0,23.950,47.21,45.99,1.970
4,f1_fuel_1994_12_2_0_10,2,1994,chickasaw,gas,mcf,40533.0,1.000,2.77,2.77,2.570
...,...,...,...,...,...,...,...,...,...,...,...
29518,f1_fuel_2018_12_12_0_13,12,2018,neil simpson ct #1,gas,mcf,18799.0,1.059,4.78,4.78,9.030
29519,f1_fuel_2018_12_12_1_1,12,2018,cheyenne prairie 58%,gas,mcf,806730.0,1.050,3.65,3.65,6.950
29520,f1_fuel_2018_12_12_1_10,12,2018,lange ct facility,gas,mcf,104554.0,1.060,4.77,4.77,8.990
29521,f1_fuel_2018_12_12_1_13,12,2018,wygen 3 bhp 52%,coal,ton,315945.0,16.108,3.06,14.76,1.110


In [124]:
#Left kepps the left and removes rows that doesnt match the left

pd.merge(fuel_df1,fuel_df2, how='left')

Unnamed: 0,record_id,utility_id_ferc1,report_year,plant_name_ferc1,fuel_type_code_pudl,fuel_unit,fuel_qty_burned,fuel_mmbtu_per_unit,fuel_cost_per_unit_burned,fuel_cost_per_unit_delivered,fuel_cost_per_mmbtu
0,f1_fuel_1994_12_1_0_7,1,1994,rockport,coal,ton,5377489.0,16.590000,18.590,18.530,1.121
1,f1_fuel_1994_12_1_0_10,1,1994,rockport total plant,coal,ton,10486945.0,16.592000,18.580,18.530,1.120
2,f1_fuel_1994_12_2_0_1,2,1994,gorgas,coal,ton,2978683.0,24.130000,39.720,38.120,1.650
3,f1_fuel_1994_12_2_0_7,2,1994,barry,coal,ton,3739484.0,23.950000,47.210,45.990,1.970
4,f1_fuel_1994_12_2_0_10,2,1994,chickasaw,gas,mcf,40533.0,1.000000,2.770,2.770,2.570
...,...,...,...,...,...,...,...,...,...,...,...
18995,f1_fuel_2009_12_182_1_9,182,2009,lake road,gas,mcf,340857.0,1.000000,4.711,4.711,4.711
18996,f1_fuel_2009_12_182_1_10,182,2009,lake road,oil,mcf,771.0,5.801544,84.899,84.899,14.634
18997,f1_fuel_2009_12_182_1_13,182,2009,iatan (18%),coal,ton,414142.0,16.718000,18.509,17.570,1.107
18998,f1_fuel_2009_12_182_1_14,182,2009,iatan (18%),oil,bbl,5761.0,5.537910,83.636,72.280,15.102


In [125]:
pd.concat([fuel_df1,fuel_df2]).reset_index(drop=True) #Concatinate and retain there respective indexes

Unnamed: 0,record_id,utility_id_ferc1,report_year,plant_name_ferc1,fuel_type_code_pudl,fuel_unit,fuel_qty_burned,fuel_mmbtu_per_unit,fuel_cost_per_unit_burned,fuel_cost_per_unit_delivered,fuel_cost_per_mmbtu
0,f1_fuel_1994_12_1_0_7,1,1994,rockport,coal,ton,5377489.0,16.590,18.59,18.53,1.121
1,f1_fuel_1994_12_1_0_10,1,1994,rockport total plant,coal,ton,10486945.0,16.592,18.58,18.53,1.120
2,f1_fuel_1994_12_2_0_1,2,1994,gorgas,coal,ton,2978683.0,24.130,39.72,38.12,1.650
3,f1_fuel_1994_12_2_0_7,2,1994,barry,coal,ton,3739484.0,23.950,47.21,45.99,1.970
4,f1_fuel_1994_12_2_0_10,2,1994,chickasaw,gas,mcf,40533.0,1.000,2.77,2.77,2.570
...,...,...,...,...,...,...,...,...,...,...,...
29518,f1_fuel_2018_12_12_0_13,12,2018,neil simpson ct #1,gas,mcf,18799.0,1.059,4.78,4.78,9.030
29519,f1_fuel_2018_12_12_1_1,12,2018,cheyenne prairie 58%,gas,mcf,806730.0,1.050,3.65,3.65,6.950
29520,f1_fuel_2018_12_12_1_10,12,2018,lange ct facility,gas,mcf,104554.0,1.060,4.77,4.77,8.990
29521,f1_fuel_2018_12_12_1_13,12,2018,wygen 3 bhp 52%,coal,ton,315945.0,16.108,3.06,14.76,1.110


In [127]:
#Check for duplicated rows

fuel_data.duplicated().any()

False