# Study Time Analysis

Analysis using studied time and grades.

## 1. Import packages

In [13]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

## 2. Import data

In [14]:
time_tracking_df = pd.read_excel('./TimeTracking.xlsx')
grades_df = pd.read_excel('./Snittbetyg - Beräkning.xlsx')

## 3. GPA Calculation

In [15]:
# Multiply HP with grade
grades_df["Score"] = grades_df["Betyg"] * grades_df["HP"]

In [16]:
tot_hp = grades_df["HP"].sum()
tot_score = grades_df["Score"].sum()
GPA = tot_score / tot_hp

In [17]:
GPA = tot_score / tot_hp

print("Tot HP graded: " + str(tot_hp))
print("Tot score graded: " + str(tot_score))
print("GPA: " + str(GPA))

Tot HP graded: 181.0
Tot score graded: 561.0
GPA: 3.0994475138121547


## 4. TimeTracking Data Overview

In [18]:
courses = list(time_tracking_df["Kurs"].unique()) # also includes "Planering" and "Annat".
print("Number of courses: " + str(len(courses)))
# print("\n".join(courses))

Number of courses: 29


In [19]:
# Create a list of column names
column_names = ["Kurs", "Timmar"]

# Create an empty DataFrame with the specified column names
courses_df = pd.DataFrame(columns=column_names)

# Add data to column "Kurs"
courses_df["Kurs"] = courses
courses_df = courses_df.sort_values(by='Kurs')

# Calculate hours per course, to column "Timmar"
#courses_df["Timmar"] = time_tracking_df[time_tracking_df["Kurs"]  ]
minutes_per_course = time_tracking_df.groupby('Kurs')["Total Tid (minuter)"].sum()
courses_df["Timmar"] = np.round(minutes_per_course.values / 60, 2)

# Perform the left join
courses_df = pd.merge(courses_df, grades_df, on='Kurs', how='left')

# Fill the empty 'Grade' values with "-1"
courses_df['Betyg'] = courses_df['Betyg'].fillna(-1)

# Drop the "Score" column
courses_df.drop('Score', axis=1, inplace=True)

# Add expected time spent column
courses_df["Förväntade timmar"] = round((courses_df["HP"] / 1.5) * 40, 2)
courses_df["Tid per förväntat (%)"] = round((courses_df["Timmar"] / courses_df["Förväntade timmar"]) * 100, 2)

courses_df.sort_values(by="Tid per förväntat (%)")

Unnamed: 0,Kurs,Timmar,Termin,Betyg,HP,Förväntade timmar,Tid per förväntat (%)
3,Datorer och Datoranvändning,4.67,HT20,-1.0,3.0,80.0,5.84
16,Kandidatarbete,59.5,HT22,-1.0,15.0,400.0,14.88
27,Tillämpad maskininlärning,33.58,HT22,3.0,7.5,200.0,16.79
28,Utvärdering av Programvarusystem,36.25,VT21,-1.0,7.0,186.67,19.42
0,Affärsdriven programvaruutveckling,47.25,HT22,4.0,7.5,200.0,23.62
12,Flertrådad programmering,49.58,HT22,5.0,7.5,200.0,24.79
20,Matematisk statistik,50.0,HT22,5.0,7.5,200.0,25.0
15,Introduktion till artificiella neuronnätverk o...,50.58,HT22,4.0,7.5,200.0,25.29
8,Elektronik,37.67,VT22,3.0,5.0,133.33,28.25
5,Datorteknik,45.25,VT22,4.0,6.0,160.0,28.28


In [20]:
tot_hours = courses_df["Timmar"].sum()
tot_expected = courses_df["Förväntade timmar"].sum()
tot_of_expected = tot_hours / tot_expected
print("Total studerad tid av förväntad: " + str(round(tot_of_expected*100, 2)) + " %")
print("Tid studerad per 40h studievecka: " + str(round(tot_of_expected*40, 2)) + " timmar")
print("Studerad tid: " + str(round(tot_hours, 2)) + " timmar")
print("Förväntad tid: " + str(round(tot_expected, 2)) + " timmar")

Total studerad tid av förväntad: 39.21 %
Tid studerad per 40h studievecka: 15.68 timmar
Studerad tid: 1892.6 timmar
Förväntad tid: 4826.65 timmar


## 5. Test Statistical Significance Between Grades & Relative Time Spent

### 5.1 All courses that are not G/IG

#### 5.11 Evaluation

In [21]:
cleaned_df = courses_df[courses_df["Betyg"] != -1]

In [23]:
# Fit a linear regression model to the data
x = cleaned_df["Tid per förväntat (%)"]
y = cleaned_df["Betyg"]
x_new = sm.add_constant(x) # Add a constant to the independent variable
model = sm.OLS(y, x_new).fit()

# Print the summary of the model to see the results
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  Betyg   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                 -0.034
Method:                 Least Squares   F-statistic:                    0.2844
Date:                Wed, 08 Feb 2023   Prob (F-statistic):              0.599
Time:                        08:08:07   Log-Likelihood:                -26.772
No. Observations:                  23   AIC:                             57.54
Df Residuals:                      21   BIC:                             59.81
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     4.25

#### 5.12 Conclusion: No linear trend between grades and relative time spent

In the output, the p-value associated with the Tid per förväntat (%) column (0.599) can be used to determine if there is a statistically significant relationship between Tid per förväntat (%) and Betyg. If the p-value is less than a significance level (usually 0.05), then you can conclude that there is a statistically significant relationship between the two variables.

In this case, the p-value (0.599) is greater than 0.05, which means that there is not enough evidence to reject the null hypothesis that there is no relationship between Tid per förväntat (%) and Betyg. This means that the relationship between the two variables is not statistically significant at a significance level of 0.05.

### 5.2 All courses minus G/IG courses and Programming courses

#### 5.21 Evaluation

In [28]:
cleaned_df = cleaned_df[~cleaned_df['Kurs'].isin(["Programmering Grundkurs", "Objektorienterad Modellering och Design", "Programmeringsteknik Fördjupningskurs", "Grundläggande funktionsprogrammering", "Flertrådad programmering"])]

In [29]:
# Fit a linear regression model to the data
x = cleaned_df["Tid per förväntat (%)"]
y = cleaned_df["Betyg"]
x_new = sm.add_constant(x) # Add a constant to the independent variable
model = sm.OLS(y, x_new).fit()

# Print the summary of the model to see the results
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  Betyg   R-squared:                       0.044
Model:                            OLS   Adj. R-squared:                 -0.016
Method:                 Least Squares   F-statistic:                    0.7348
Date:                Wed, 08 Feb 2023   Prob (F-statistic):              0.404
Time:                        09:02:05   Log-Likelihood:                -19.009
No. Observations:                  18   AIC:                             42.02
Df Residuals:                      16   BIC:                             43.80
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     4.17



#### 5.22 Conclusion: No linear trend between grades and relative time spent

In the output, the p-value associated with the Tid per förväntat (%) column (0.404) can be used to determine if there is a statistically significant relationship between Tid per förväntat (%) and Betyg. If the p-value is less than a significance level (usually 0.05), then you can conclude that there is a statistically significant relationship between the two variables.

In this case, the p-value (0.404) is greater than 0.05, which means that there is not enough evidence to reject the null hypothesis that there is no relationship between Tid per förväntat (%) and Betyg. This means that the relationship between the two variables is not statistically significant at a significance level of 0.05.