### 1. Data Loading and Preprocessing:
Load stock_n_hl_news.csv into DataFrame df.

Convert 'Date' column to datetime format.

In [1]:
import pandas as pd

#load data
df = pd.read_csv('stock_n_hl_news.csv')
print("Shape:", df.shape)
print(df.head())

#convert 'Date' to datetime
df['Date'] = pd.to_datetime(df['Date'])
print(df.dtypes['Date'])
print(df.head())

Shape: (483, 7)
         Date       AAPL       NASDAQ          NYA        SP500          DJI  \
0  2017-12-18  41.653164  6994.759766  12785.79980  2690.159912  24792.19922   
1  2017-12-19  41.209293  6963.850098  12747.50000  2681.469971  24754.69922   
2  2017-12-20  41.164425  6960.959961  12747.59961  2679.250000  24726.69922   
3  2017-12-21  41.320255  6965.359863  12800.20020  2684.570068  24782.30078   
4  2017-12-22  41.320255  6959.959961  12797.40039  2683.340088  24754.09961   

                                            Headline  
0  France saves Marquis de Sade’s 120 Days of Sod...  
1  House prices to fall in London and south-east ...  
2  Hedge funds fail to stop 'billion-dollar brain...  
3  Guardian Brexit watch  \n\n\n  Brexit helped p...  
4  Jim Cramer broke down why owning fewer stocks ...  
datetime64[ns]
        Date       AAPL       NASDAQ          NYA        SP500          DJI  \
0 2017-12-18  41.653164  6994.759766  12785.79980  2690.159912  24792.19922   


### 2. Market Trend Column:

In df, create a column 'Trend'.

Convert the feedback scores into a 2-dimensional numpy array (iOS scores as the first
dimension, Android scores as the second).

Label 'Trend' as 'Bullish' if NASDAQ index is higher than the previous day, 'Bearish'
otherwise.

Assume 'Bullish' for the first date in the dataset.

In [2]:
import numpy as np

#generate trend
df['Trend'] = np.where(df['NASDAQ'].diff() > 0, 'Bullish', 'Bearish')
#set the first available day to Bullish
df.loc[0, 'Trend'] = 'Bullish'

#verify
print(df[['Date','NASDAQ','Trend']].head())
print(df['Trend'].value_counts())

        Date       NASDAQ    Trend
0 2017-12-18  6994.759766  Bullish
1 2017-12-19  6963.850098  Bearish
2 2017-12-20  6960.959961  Bearish
3 2017-12-21  6965.359863  Bullish
4 2017-12-22  6959.959961  Bearish
Trend
Bullish    275
Bearish    208
Name: count, dtype: int64


### Interpretation: Feedback Analysis

The t-test compares feedback scores between iOS and Android users. 

- Test Statistic: 1.9033888211703986
- P-value: 0.05756609386358278

Since the p-value is greater than 0.05, we fail to reject the null hypothesis. This means we **do not have sufficient evidence** to say that average customer satisfaction is significantly different between iOS and Android apps.

### 3. Sales Performance Analysis (sales_analysis function): 
Compare sales before and after a major marketing campaign (March 1-31, 2023).

Use an appropriate T-Test to assess the campaign's impact on sales.

Return the t-test statistic and the p-value and print the returned values.

Interpret the result, i.e., if a significant impact is found based on the p-value;

In [8]:
#function: Sales Analysis
def sales_analysis(df_sales):
    #split data before and after March 2023 campaign
    before = df_sales[df_sales['date'] < '2023-03-01']['sales'].values
    after = df_sales[df_sales['date'] > '2023-03-31']['sales'].values

    #perform independent t-test
    statistic, p_val = stats.ttest_ind(before, after, equal_var=False)

    print("Sales analysis t-test statistic:", statistic)
    print("Sales analysis pvalue:", p_val)

    return statistic, p_val

#run sales analysis
sales_stat, sales_p = sales_analysis(sales_df)

Sales analysis t-test statistic: 0.16642710322927962
Sales analysis pvalue: 0.8679489234386756


### Interpretation: Sales Analysis

The t-test compares sales figures before and after the marketing campaign in March 2023.

- Test Statistic: 0.16642710322927962
- P-value: 0.8679489234386756

Since the p-value is significantly greater than 0.05, we fail to reject the null hypothesis. This means we **do not have sufficient evidence** that the marketing campaign had a measurable impact on sales.

### 4. Seasonal Sales Analysis (seasonal_analysis function):
Examine sales differences between summer (June-August) and winter
(December-February).

Apply a T-Test to assess if these variations are statistically significant.

Return the t-test statistic and the p-value and print the returned values.

Interpret the result, i.e., if significant seasonal variations exists based on the p-value.

In [4]:
#function: Seasonal Sales Analysis
def seasonal_analysis(df_sales):
    #define summer and winter months
    summer_months = [6, 7, 8]
    winter_months = [12, 1, 2]

    summer_sales = df_sales[df_sales['date'].dt.month.isin(summer_months)]['sales'].values
    winter_sales = df_sales[df_sales['date'].dt.month.isin(winter_months)]['sales'].values

    #perform independent t-test
    statistic, p_val = stats.ttest_ind(summer_sales, winter_sales, equal_var=False)

    print("Seasonal analysis t-test statistic:", statistic)
    print("Seasonal analysis pvalue:", p_val)

    return statistic, p_val

#run seasonal analysis
seasonal_stat, seasonal_p = seasonal_analysis(sales_df)



Seasonal analysis t-test statistic: 0.09956961638905915
Seasonal analysis pvalue: 0.9207644588060664


### Interpretation: Seasonal Sales Analysis

This t-test compares sales between the summer months (June-August) and winter months (December-February).

- Test Statistic: 0.09956961638905915
- P-value: 0.9207644588060664

Since the p-value is greater than 0.05, we fail to reject the null hypothesis. This means we **do not have sufficient evidence** of seasonal variation in sales between summer and winter.

### 5. Feedback Consistency Analysis (consistency_analysis function):
Assess if monthly feedback scores are consistent across January, May, September, and
December.

Use one-way ANOVA to test for significant differences in feedback scores across these
months.

Return the statistic and the p-value and print the returned values.

Interpret the result, i.e., if the difference are significant based on p-value.


In [5]:
#function Feedback Consistency Analysis
def consistency_analysis(df_feedback):
    #filter months of interest
    df_feedback['month'] = df_feedback['date'].dt.month
    selected_months = {1: 'Jan', 5: 'May', 9: 'Sep', 12: 'Dec'}
    filtered = df_feedback[df_feedback['month'].isin(selected_months.keys())]

    # Group feedback scores by month
    jan = filtered[filtered['month'] == 1]['feedback_score'].values
    may = filtered[filtered['month'] == 5]['feedback_score'].values
    sep = filtered[filtered['month'] == 9]['feedback_score'].values
    dec = filtered[filtered['month'] == 12]['feedback_score'].values

    #perform one-way ANOVA
    statistic, p_val = stats.f_oneway(jan, may, sep, dec)

    print("Feedback consistency ANOVA statistic:", statistic)
    print("Feedback consistency pvalue:", p_val)

    return statistic, p_val

#run consistency analysis
consistency_stat, consistency_p = consistency_analysis(feedback_df)



Feedback consistency ANOVA statistic: 0.3146823675455494
Feedback consistency pvalue: 0.8147473590881886


### Interpretation: Feedback Consistency Analysis

The one-way ANOVA compares customer feedback scores across the months of January, May, September, and December.

- ANOVA Statistic: 0.3146823675455494
- P-value: 0.8147473590881886

Since the p-value is greater than 0.05, we conclude that there is no statistically significant difference in average feedback scores across these months.

### 6. Sales and Feedback Correlation Analysis (corr_analysis function):
Investigate if high customer feedback correlates with increased sales.

Merge feedback and sales data, categorizing sales into high and low feedback scores.

Perform a T-Test to compare sales in months with high vs. low feedback scores.

Return the statistic and the p-value and print the returned values.

Interpret the result, i.e., if correlation is significant based on the p-value.

In [None]:
#function: Sales and Feedback Correlation Analysis
def corr_analysis(df_feedback, df_sales):
    #aggregate feedback scores by date
    avg_feedback = df_feedback.groupby('date')['feedback_score'].mean().reset_index()
    avg_feedback.columns = ['date', 'avg_feedback_score']

    #aggregate sales by date
    total_sales = df_sales.groupby('date')['sales'].sum().reset_index()

    #merge both datasets on date
    merged = pd.merge(avg_feedback, total_sales, on='date')

    #label high vs. low feedback
    threshold = merged['avg_feedback_score'].median()
    high_feedback = merged[merged['avg_feedback_score'] > threshold]['sales']
    low_feedback = merged[merged['avg_feedback_score'] <= threshold]['sales']

    #perform t-test
    statistic, p_val = stats.ttest_ind(high_feedback, low_feedback, equal_var=False)

    print("Correlation analysis t-test statistic:", statistic)
    print("Correlation analysis pvalue:", p_val)

    return statistic, p_val

#run correlation analysis
corr_stat, corr_p = corr_analysis(feedback_df, sales_df)


Correlation analysis t-test statistic: 0.5288478099451596
Correlation analysis pvalue: 0.5978430534511507


### Interpretation: Sales and Feedback Correlation Analysis

This t-test evaluates whether higher average feedback scores are associated with significantly different sales performance.

- Test Statistic: 0.5288478099451596
- P-value: 0.5978430534511507

Since the p-value is greater than 0.05, we fail to reject the null hypothesis. There is **no significant evidence** that higher feedback scores correlate with higher sales.