# Advanced Classification Task

In this exercise, we will develop a neural network based classification method to discriminate between WZ and ZZ process.
WZ and ZZ are standard model processes which can be produced at the proton-proton collision at LHC, CERN.

- WZ: W-boson with a Z-boson is produced
- ZZ: Two Z boson is produced

As we know these bosons have very short lifetime, they decay into other lighter particles such as leptons, quarks(jets) etc.

For today we consider they decay in the following way,
- WZ $\rightarrow lll\nu$
- ZZ $\rightarrow llll$
where l stands for light leptons which includes electron($e$) and muon($\mu$).

We look at properties of the decay products, and try to distinguish WZ from ZZ. You have been given input files with some such properties to use as input variables.

Follow today's lecture slides to learn about the variables and the set up in details.

## Structure of the notebook

- The problem is broken down in multiple short tasks and related codeblocks.
- We provide instructions for each task. 
- We expect you to finish the codeblocks and get the result.

**Let's start!**

## Task 1

- Import all the necessary libraries (numpy, pandas, .. etc)
- dont forget train_test_split, roc_curve, auc

In [None]:
import os
import warnings
warnings.filterwarnings('ignore')

# Give some output name for your file with plots, eg. output.pdf
outputname = ''
from matplotlib.backends.backend_pdf import PdfPages
pp = PdfPages(outputname)


## Complete this block


# Task 2

- Read in some number of variables from the input files
- input files are   **input_WZ.txt , input_ZZ.txt** 
- Look at the companion file ```hep_classify_plot_variables.py```
- for a list of variables and their names 
- Make dataframe for WZ and ZZ

In [None]:
col_names=
cols=

# Read in the two dataframes, one for WZ and one for ZZ

WZBk =
ZZBk =



# Task 3

- Assign target labels for WZ and ZZ, one is 0, other is 1
- This is done by adding one additional column to each dataframe with that specific value
- Merge the two dataframes into one for training
- Split the label column as y, and the input variables as X

In [None]:
WZBk['label']=
ZZBk['label']=

# Merge the two dataframes into one for training
data = pd.concat([WZBk,ZZBk])


# Split the label column as y, and the input variables as X
X =
y =
print(f'Shapes of data, X, y are {data.shape}, {X.shape} , {y.shape}')



# Task 4
- Now we normalize the input variables to all go from -1.0 to 1.0

- Now we split the data into a training and a testing set

In [None]:
maxValues = X.max(axis=0)
minValues = X.min(axis=0)
MaxMinusMin = X.max(axis=0) - X.min(axis=0)
normedX = 2*((X-X.min(axis=0))/(MaxMinusMin)) -1.0
X = normedX

# print the information
print("Max values")
print(maxValues)
print("Min values")
print(minValues)


# Now we split the data into a training and a testing set
X_train, X_test, y_train, y_test =
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
n_features = X_train.shape[1]
print(f'The number of input variables is {n_features}')

# Task 5
- Declare your model
- compile the model


In [None]:
model = Sequential()
model.add( )
model.add( )
model.add(Dense(1, activation='sigmoid'))

# compile the model
model.compile(  )

# Task 6
- Train the model
- Print model summary and save the model

In [None]:
# Train the model using model.fit
history =

# Print model summary and save the model
model.summary()
model.save( ... )

<br>

<mark>Phew !! Welldone folks! It's time for plotting!</mark>

<br>

# Task 7

Now make various plots
- First plot  accuracy using  history object
- plot both accuracy and val_accuracy
- Then plot loss using both loss and val_loss


In [1]:
# plot accuracy

# plot both accuracy and val accuracy

# plot both loss and val loss

# Task 8

Predict NN Score 

- plot nn score
- plot ROC

In [None]:
#Setup some new dataframes  t_df is testing, v_df is training (or validation)
t_df = pd.DataFrame()
v_df = pd.DataFrame()
t_df['train_truth'] = y_train
t_df['train_prob'] = 0
v_df['test_truth'] = y_test
v_df['test_prob'] = 0

# Now we evaluate the model on the test and train data by calling the
# predict function

val_pred_proba = model.predict(-------)
train_pred_proba = model.predict(------)
t_df['train_prob'] = train_pred_proba
v_df['test_prob'] = val_pred_proba
    
    
    

plotting nn score

In [None]:
mybins = np.arange(0,1.05,0.05)

# First we make histograms to plot the testing data as points with errors
testsig = plt.hist(v_df[v_df['test_truth']==1]['test_prob'],bins=mybins)
testsige = np.sqrt(testsig[0])
testbkg = plt.hist(v_df[v_df['test_truth']==0]['test_prob'],bins=mybins)
testbkge = np.sqrt(testbkg[0])


plt.figure(figsize=(8,8))
plt.errorbar(testsig[1][1:]-0.025, testsig[0], yerr=testsige, fmt='.', color="xkcd:green",label="Test ZZ", markersize='10')
plt.errorbar(testbkg[1][1:]-0.025, testbkg[0], yerr=testbkge, fmt='.', color="xkcd:denim",label="Test WZ", markersize='10')
plt.hist(t_df[t_df['train_truth']==1]['train_prob'],bins=mybins, histtype='step', label="Train ZZ", linewidth=3, color='xkcd:greenish',density=False,log=False)
plt.hist(t_df[t_df['train_truth']==0]['train_prob'],bins=mybins, histtype='step', label="Train WZ", linewidth=3, color='xkcd:sky blue',density=False,log=False)
plt.legend(loc='upper center')
plt.xlabel('Score',fontsize=20)
plt.ylabel('Events',fontsize=20)
plt.title(f'NN Output',fontsize=20)
plt.xticks([0.0,0.2,0.4,0.6,0.8,1.0],fontsize=12)
plt.yticks(fontsize=12)
#plt.savefig('NNscore.png')
plt.savefig(pp,format='pdf')

##  Now you add code to plot the ROC curve

In [None]:
# ROC

In [None]:
 pp.close()
 print('All done')



# Good Job!