# Description of datasets

The data-set are the "labels" files.


The first row of the files are the headers: they are the name of the variables that are present in it.
Thus, each column contains all the values that are observed for that variable.

You can start using PCA on the data-set "labels.csv" contained in the folder FPFT-CFF.
Because that is a multi-variate data-set, before using PCA, remember to choose a suitable way to center and scale your data.

You will see there are many columns, you will need to use only the following ones:
'u(m/s)'	'T(K)'	'H2'	'H'	'O'	'O2'	'OH'	'H2O'	'HO2'	'H2O2'	'C'	'CH'	'CH2'	'CH2(S)'	'CH3'	'CH4'	'CO'	'CO2'	'HCO'	'CH2O'	'CH2OH'	'CH3O'	'CH3OH'	'C2H'	'C2H2'	'C2H3'	'C2H4'	'C2H5'	'C2H6'	'HCCO'	'CH2CO'	'HCCOH'	'N'	'NH'	'NH2'	'NH3'	'NNH'	'NO'	'NO2'	'N2O'	'HNO'	'CN'	'HCN'	'H2CN'	'HCNN'	'HCNO'	'HOCN'	'HNCO'	'NCO'	'N2'	'AR'	'C3H7'	'C3H8'	'CH2CHO'	'CH3CHO'

They correspond to the following column indeces: 2	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57

For example: the second column (column number 2) contains all the values of the variable which is indicated by 'u(m/s)'. The first column is not present, so you don't have to consider it. When you use Python, Matlab or other languages, remember to read/load into a matrix only the columns that I have listed here.

https://wordstream-files-prod.s3.amazonaws.com/s3fs-public/machine-learning.png

# Importing libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [35]:
dataframe_all = pd.read_csv("labels.csv",sep='\t', engine='python' ) #shape: (26000, 55)
df = dataframe_all.loc[dataframe_all['Tin'] == 300]
dataframe_all.head()
df.head()
#print(dataframe_all['Z'].unique())

Unnamed: 0,z (m),u (m/s),V (1/s),T (K),H2,H,O,O2,OH,H2O,...,AR,C3H7,C3H8,CH2CHO,CH3CHO,Z,c,HR,sm,Tin
4,0.01324,0.089593,0.0,554.125759,0.000807,3.39812e-11,1.818058e-11,0.180601,3.219009e-09,0.0273331,...,5.923582999999999e-100,9.89865e-12,1.623327e-07,1.007418e-12,5.323347e-07,0.028302,0.2151827,-6974.898,0.0,300.0
7,0.017199,2.333259,2.8231810000000004e-175,2088.736013,0.002426,0.00055501,0.0007346723,0.022068,0.004458154,0.1666206,...,1.059033e-22,9.511895e-38,1.918469e-39,7.600638999999999e-26,9.27945e-28,0.049815,0.9726861,-14353360.0,0.0,300.0
8,0.011173,-0.289506,34.05898,1218.412351,0.000323,5.807935e-05,0.0002910539,0.136388,0.0004912981,0.08678798,...,0.009529525,0.0,0.0,1.1947549999999999e-20,7.362522000000001e-23,0.018165,-1.0,-33511780.0,25.0,300.0
9,0.011984,0.375812,7.906682e-139,300.000002,1.6e-05,8.620382999999999e-24,8.607889e-14,0.188246,5.0519420000000005e-17,2.774682e-10,...,0.0,0.0,1.135926e-15,2.601808e-26,2.091631e-16,0.060219,1.19157e-09,-2.068051e-06,0.0,300.0
17,0.013229,0.373328,-2.8012249999999997e-146,300.144781,0.001089,3.277482e-16,7.070345e-11,0.189828,8.988788e-12,1.810403e-05,...,3.1186030000000004e-23,7.432468e-14,2.329438e-12,4.160074e-18,1.891889e-11,0.055046,7.523991e-05,-1.572227,0.0,300.0


In [30]:
dataframe_all.shape

(32273, 62)

# Select Data-set

In [4]:
#features used
index = ['u (m/s)', 'T (K)', 'H2', 'H', 'O', 'O2', 'OH',
       'H2O', 'HO2', 'H2O2', 'C', 'CH', 'CH2', 'CH2(S)', 'CH3', 'CH4', 'CO',
       'CO2', 'HCO', 'CH2O', 'CH2OH', 'CH3O', 'CH3OH', 'C2H', 'C2H2', 'C2H3',
       'C2H4', 'C2H5', 'C2H6', 'HCCO', 'CH2CO', 'HCCOH', 'N', 'NH', 'NH2',
       'NH3', 'NNH', 'NO', 'NO2', 'N2O', 'HNO', 'CN', 'HCN', 'H2CN', 'HCNN',
       'HCNO', 'HOCN', 'HNCO', 'NCO', 'N2', 'AR', 'C3H7', 'C3H8', 'CH2CHO',
       'CH3CHO']
# target
target_name = ['Tin']

# Loading the Data-set

In [5]:
Y_pf = pd.read_csv("Y_pf.csv", sep=',', names = index) #shape: (26000, 55)

In [7]:
X_pf = pd.read_csv("X.csv", sep=',', names = ['unknown_1', 'unknown_2', 'Tin']) #shape: (25999, 3) 
X_pf['unknown_2'].nunique()

# What is the columns name???

13

# Exploratory Data Anaysis (EDA)

## Choosing dataset to work with

In [None]:
# Features
#df = new_labels  #shape:
df = Y_pf         #shape:

# target
#target = labels['Tin']  #shape:
target = X_pf['Tin']     #shape:

## Data description and corrolution

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
#sns.pairplot(df, size=1.5);
#plt.show()

In [None]:
col_study = ['u (m/s)', 'T (K)', 'H2', 'H', 'O', 'O2', 'OH', 'H2O']

In [None]:
col_study

In [None]:
plt.figure(figsize=(16,10))
df[col_study].hist(figsize=(16, 12), bins=50, xlabelsize=8, ylabelsize=8);
plt.title("Distribution of the features")
plt.close()

In [None]:
sns.pairplot(df[['u (m/s)', 'T (K)', 'H2', 'H', 'O', 'O2', 'OH',
       'H2O', 'HO2', 'H2O2', 'C', 'CH', 'CH2', 'CH2(S)', 'CH3', 'CH4', 'CO',
       'CO2', 'HCO', 'CH2O', 'CH2OH', 'CH3O', 'CH3OH', 'C2H', 'C2H2', 'C2H3',
       'CH3CHO']], size=3);
#plt.close()



# Correlation between data set

In [None]:
# Correlation between the data
plt.figure(figsize=(16,10))
sns.heatmap(df[col_study].corr(), annot=True)
plt.title("Correlation between data set")
#plt.close()

In [None]:
temp = df['T (K)'].values.reshape(-1,1)

In [None]:
H2O = df['H2O'].values

In [None]:
plt.figure(figsize=(6,6));
x = df['T (K)'].values
y = df['H2O'].values
plt.scatter(x, y);
plt.xlabel('T (K)')
plt.ylabel("H2O")
#plt.close()