### Assignment 3 - Logistic Regression

In [None]:
# This code appears in every demonstration Notebook.
# By default, when you run each cell, only the last output of the codes will show.
# This code makes all outputs of a cell show.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In this assignment, we will build logistic regression models to detect accounting fraud using financial statement features. <br>
The data was collected by Bao et al. (2020) based on the detected material accounting misstatements disclosed in the SEC’s Accounting and Auditing Enforcement Releases (AAERs). <br>
The dataset covers all publicly listed U.S. firms over the period 1990– 2014. The variable name of the fraud label is "misstate" (1 denotes fraud, and 0 denotes non-fraud). <br>
We will use both raw financial data from the financial statements and the financial ratios that are used to evaluate the financial performance of a company for detection.<br>

You may find the description of variables in the Word document.

1. Import the libraries

2. Read in the dataset and display basic information about the dataset.

3. Explore the variable 'misstate' with a graph. What do you observe?

4. Next we sum the number of fraud cases by year and make a line graph.<br>
First we need to use .groupby() method to do the sum. We did not go over this in class. I explain here.
Then you can use the result to create a line graph.

In [None]:
Fraud.groupby('fyear')['misstate'].sum().reset_index()

# Groupby method group the data observations by the given variable 'fyear'
# into groups.
# Then the sum() will sum the variable 'misstate'
# reset_index() is to transform the result into a dataframe

Save the output of the code above and make a line graph based on it. What do you observe?

5. The percentage of fraud cases is really small. To have better prediction power, we intend to oversample the fraud cases to 10% of the sample. Please run the code below. Pay attention to how I name the datasets. Change them to adapt to your cases. <br>
You may notice that after oversampling, the number of fraud cases increased.

In [None]:
# Separate into minority and majority
minority_class = Fraud[Fraud['misstate'] == 1]
majority_class = Fraud[Fraud['misstate'] == 0]

# Count minority and majority samples
minority_count = len(minority_class)
majority_count = len(majority_class)
print("Original class distribution:", Fraud['misstate'].value_counts())

###############################
# Desired ratio = 10% / 90%   #
###############################
# For a 10/90 ratio, 1:9 (minority : majority)
# If we have 'N' majority samples, we want M' = N/9 minority samples.

RATIO = 9  # 1 minority : 9 majority
majority_N = majority_count

# Calculate how many minority samples we need to achieve 10/90 ratio
minority_needed = int(np.ceil(majority_N / RATIO))

# If we already have enough minority samples, no oversampling needed
# Otherwise, sample (with replacement) from the minority to get the required count
if minority_needed <= minority_count:
    oversampled_minority = minority_class
else:
    # Randomly sample with replacement to reach minority_needed
    oversampled_minority = minority_class.sample(n=minority_needed, replace=True, random_state=0)

# Combine the new minority subset with the entire majority
Fraud_oversampled = pd.concat([oversampled_minority, majority_class])

# Shuffle the dataset
Fraud_oversampled = Fraud_oversampled.sample(frac=1, random_state=42).reset_index(drop=True)

# Check new class distribution
print("New class distribution:", Fraud_oversampled['misstate'].value_counts())

6. Missing values. You may notice that some variables have missing values. <br>
Ideally, we need to handle missing values carefully. We will explore that in the future if we have the chance.<br>
For now, we just simply drop the observations with missing values. Use dropna() to do that.

7. Now let's fit logistic regression models. First, we only use the 14 financial ratio variables as the independent variables. You may find the definitions of them in the Word document.

Prepare the data.

8. Fit the model using statsmodels. Show the results. Which variables are not significant?