## A7 Final Project

### Introduction

This project encapsulates my final project for HCDE 410, Human Data Interaction, taken Spring 2022. This project uses python code and jupyter notebooks to look into the affects popular alien movies have on UFO sightings. This project looks to be understandable, reproduceable, and transparent in its findings and processes. 

What I am planning to do with this analysis is determine if UFO sightings have a correlation to alien movies released, if the timing of the movie release, the country of origin (i.e. released in the US) has an impact on the reports. I have decided to choose 3 popular alien movies from 3 different decades. The time split so as to ideally limit the influence they have on each other. 

I am curious about the impact media has on culture as well as exploring how the sub-concious may be influenced by what we think we see. This is interesting and useful from a more practicial and human-centered perspective because it will uncover a possible influence of technology or show no correlation at all, perhaps pushing us to consider the extra-terrestrial more seriously. I hope to learn more about what factors may influence UFO sighting reports see if there is a trend among where/when/what is seen. Film and media often influence us, it is my curiousity that they may influence our sub-concious and imagination to see things we hope to find. 


### Background and Related Work
The influence of media influences our behavior in ways we are sometimes unconcious of, it is through these unconcious channels of influence there can be attempts made to tap into this resource. If this influence exists there could be many motivations to tap into this resource of public influence, political, monetary, or personal reasons may apply. 

One psychology paper digs into this idea, "A QUANTITATIVE, NONREACTIVE STUDY OF MASS BEHAVIOR WITH EMPHASIS ON THE CINEMA AS BEHAVIORAL CATALYST"(1981) (https://journals.sagepub.com/doi/pdf/10.2466/pr0.1981.48.3.775) cites an example of the movie Jaws released in 1975, a movie where the plot revolved around the sudden appearance and attacks of a Great White shark on swimmers at a beach resort, and the attempts to kill the shark. The quantitative data they recorded showing the affect of the film on the mass behavior was through media instances of shark attacks. There was an increased number of newspaper articles, photographs, and cartoons on sharks/Jaws occurred in 1975 over the same months in different years. This is particularly relevant to the issue of bias in the news media as various scientists publicly pointed out that there had not been an increase in either shark sighting or attacks. This is key to the psychological reporting affect the media can have on the public. 

Additionally the sensitivity of the topic can affect how it is presented, UFOs having a long history of government influence can make them seem even more novel to the public. In this report (https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.825.3162&rep=rep1&type=pdf) highlighting the history of UFOs and the government and the possible implications that although the extraterrestrial hypothesis “has not been categorically proven... strong presumptions exist in its favour.” The report then goes on to consider in detail the likely consequences of open extraterrestrial contact for politics, science and religion. This vague secrecy and novel interest may pull reporters in further. 

### Research Questions and Hypotheses

#### Research Question
"How did the release of popular alien movies Close Encounters of the Third Kind(1977), E.T. The Extra-Terrestrial(1982), and Independence Day (1996) affect UFO sighting frequency?"

#### Hypothesis
After major alien movie releases UFO sighitings increase in frequnecy as opposed to time periods before the movie was released. 

### Data for Analysis

The data I plan to use this data set for the "[UFO sightings](https://www.kaggle.com/datasets/NUFORC/ufo-sightings)". I am interested in 3 movies, "[Close Encounters of the Third Kind](https://en.wikipedia.org/wiki/Close_Encounters_of_the_Third_Kind)"(1977), [E.T. The Extra-Terrestrial](https://en.wikipedia.org/wiki/E.T._the_Extra-Terrestrial)"(1982), and "[Independence Day](https://en.wikipedia.org/wiki/Independence_Day_(1996_film))" (1996). These each include alien related themes and are top grossing, popular movies, I chose one movie from each decade so as their affects may not overlap.


#### UFO Sighting Data Set(https://www.kaggle.com/datasets/NUFORC/ufo-sightings)

This dataset contains over 80,000 reports of UFO sightings over the last century. There are two versions of this dataset: scrubbed and complete. The complete data includes entries where the location of the sighting was not found or blank (0.8146%) or have an erroneous or blank time (8.0237%). The srubbed data set removes these values, I will use this data set for my analysis. Since the reports date back to the 20th century, some older data might be obscured the data contains city, state, time, description, and duration of each sighting.

The liscense is Unknown (as seen on the data set website)

This data set is suitable because it has the location and year of the UFO sighting allowing me to compare to movie release years and location (USA) 

Possible ethical considerations is the data set includes exact latitude points and descriptions with varying levels of detail about location or the person reporting. Being sure to reduce the chances of data being traced back to a person in a time/place will be something to consider when highlighting data points. My peers brought up notes about this, just to be wary of what data I highlight and being sure the comments protect people, by removing names or exact personal information. 

### Methodology

#### Procedure
How I intend to analyze my data is:
1. For each of my three movies, I will find the amount of UFO sightings 1 year before the release date and 1 year after the release date as well as the year of to be complete. The range of 1 year is used as to provide a narrow view to see the affect.
2. The filtering for the UFO data will be to scope to the USA reports as all the films are released and based in the US.
3. Compare the two values of reported sightings before and after the release date and complete any relevant statistical tests

#### Statistical Analysis

I will compare the values using a bar chart to depeict the counts of sightings for all of the movies for easy readability. I will then use a T test for 2 independent means, comparing treatment 1 as the pre movie release sightings and treatment 2 as the post movie reported sightings. I plan to use "[this calculator](https://www.socscistatistics.com/tests/studentttest/default2.aspx)" to complete my calculations for this step 

This is a two tailed statistical test because I do not know the directionality of the hypothesis, and last will use a p value of 0.05 to determine if the data is statistically significant. 

#### Data Processing

To get the data in the notebook we first need to import the data.

In [7]:
#import the csv module, a little code toolkit for working with spreadsheet-style data files
import csv
import pandas as pd

Once we have the libraries loaded we can then access the csv's, Lets see what data points we have overall, looking at the range index and memory used to see how that may change when we clean up the data and remove the missing values.

In [8]:
# load in scrubbed data
df = pd.read_csv('scrubbed.csv', low_memory=False)

# Check info to see what we are working with
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80332 entries, 0 to 80331
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   datetime              80332 non-null  object 
 1   city                  80332 non-null  object 
 2   state                 74535 non-null  object 
 3   country               70662 non-null  object 
 4   shape                 78400 non-null  object 
 5   duration (seconds)    80332 non-null  object 
 6   duration (hours/min)  80332 non-null  object 
 7   comments              80317 non-null  object 
 8   date posted           80332 non-null  object 
 9   latitude              80332 non-null  object 
 10  longitude             80332 non-null  float64
dtypes: float64(1), object(10)
memory usage: 6.7+ MB


Lets also print out the first five data points to see what type of data we are working with. 

In [10]:
#view data
df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


Looks like there is a repeat of dates, the date of reporting (datetime) and date posted. We are interested in datetime because that is when the sighting occured. So to be concise lets removing the date posted.

In [11]:
# drop columns that will not be relevant
df.drop(['date posted'], axis=1, inplace=True)

# view data, to see how it has changed
df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,21.4180556,-157.803611


Great! No more date posted, now , lets make sure that all the data we are working with has values and there are no empty fields. Empty fields could cause incorrect counts for sighting instances in certain years. 

In [12]:
# Drop any rows that have any missing data
df.dropna(how = 'any', axis=0, subset = ['country','datetime', 'latitude', 'shape'],inplace=True)

# Reset index
df.reset_index(drop=True, inplace=True)

Now that we have removed the data that has empty fields, lets check to see how that affected out rangeindex and memory. 

In [13]:
# Check info to see how much of a difference that made
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69001 entries, 0 to 69000
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   datetime              69001 non-null  object 
 1   city                  69001 non-null  object 
 2   state                 66524 non-null  object 
 3   country               69001 non-null  object 
 4   shape                 69001 non-null  object 
 5   duration (seconds)    69001 non-null  object 
 6   duration (hours/min)  69001 non-null  object 
 7   comments              68993 non-null  object 
 8   latitude              69001 non-null  object 
 9   longitude             69001 non-null  float64
dtypes: float64(1), object(9)
memory usage: 5.3+ MB


Looks like the memory went down about a MB! From 6.7 to 5.3! And our rangeindex went down from from 80332 to 69001! So our removal was effective!

Now we are ready to start working with the data to better find years, datetime is in a pretty cluttered format right now, we need to have clear access to the year so lets seperate out each values so we can filter easier to the year.

In [14]:
# Get rid of time, looks like there was some issue with the formatting, and not worth correcting
df['datetime'] = df['datetime'].str.split(expand=True)[0]

# Convert column to a datetime object
df['datetime'] = pd.to_datetime(df['datetime'])

# Create a new column, year, using just the year from the datetime
df['year'] = df['datetime'].dt.year

# Create new column, month, using just the month from the datetime
df['month'] = df['datetime'].dt.month

Now lets check to see if our data looks right! 

In [17]:
# view data 
df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,latitude,longitude,year,month
0,1949-10-10,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,29.8830556,-97.941111,1949,10
1,1955-10-10,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,53.2,-2.916667,1955,10
2,1956-10-10,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,28.9783333,-96.645833,1956,10
3,1960-10-10,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,21.4180556,-157.803611,1960,10
4,1961-10-10,bristol,tn,us,sphere,300,5 minutes,My father is now 89 my brother 52 the girl wit...,36.595,-82.188889,1961,10


Now as we look at the values we have here, we have some in great britian! Because we are assessing US movies we only want to look at the sightings that occured in the US. Lets look at how many total just for reference of how much data there is over all the years and also check to see that the values were removed.

In [19]:
#country we are interested in the US
country_of_interest = ['us']

# filter to just the us
us_only  = df[df['country'].isin(country_of_interest)]

#how many sightings were there overall, just for fun, to know how much data we are working with
us_only.value_counts('country', ascending=False)

#View data, check to see if the great britian value is gone
us_only.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,latitude,longitude,year,month
0,1949-10-10,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,29.8830556,-97.941111,1949,10
2,1956-10-10,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,28.9783333,-96.645833,1956,10
3,1960-10-10,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,21.4180556,-157.803611,1960,10
4,1961-10-10,bristol,tn,us,sphere,300,5 minutes,My father is now 89 my brother 52 the girl wit...,36.595,-82.188889,1961,10
6,1965-10-10,norwalk,ct,us,disk,1200,20 minutes,A bright orange color changing to reddish colo...,41.1175,-73.408333,1965,10


Now its time to look at the years of interest!! We will use our data set that is cleaned of empty fields, year is better accessible, and only looking at the US values. We will start with the 1977 movie Close Encounters of the Third Kind.

In [21]:
#Close Encounters of the Third Kind(1977)2 years 

#look at the us only counts for the year 1976, before release
yearbefore = us_only.value_counts(df['year'] == 1976)

#look at the us only counts for the year 1977, year of release
yearof = us_only.value_counts(df['year'] == 1977)

#look at the us only counts for the year 1978, year after release
yearafter = us_only.value_counts(df['year'] == 1978)

#convert variables to strings so they are readable when we print them
s1 = str(yearbefore)
s2 = str(yearof) 
s3 = str(yearafter) 


#print only the numbers of interest, the true values, whose index is 31-35
print("Close Encounters of the Third Kind")
print("There were " + s1[31:35] + "sightings in 1976")
print("There were " + s2[31:35] + " sightings in 1977")
print("There were " + s3[31:35] + " sightings in 1978")

Close Encounters of the Third Kind
There were 217
sightings in 1976
There were 204
 sightings in 1977
There were 255
 sightings in 1978


We will repeat this process with the 1982 movie E.T. Extra-Terrestrial

In [22]:
#E.T. The Extra-Terrestrial(1982)

#look at the us only counts for the year 1981, year before release
yearbefore = us_only.value_counts(df['year'] == 1981)

#look at the us only counts for the year 1982, year of release
yearof = us_only.value_counts(df['year'] == 1982)

#look at the us only counts for the year 1983, year after release
yearafter = us_only.value_counts(df['year'] == 1983)


#convert variables to strings so they are readable when we print them
s1 = str(yearbefore)
s2 = str(yearof) 
s3 = str(yearafter) 

#print only the numbers of interest, the true values, whose index is 31-35
print("ET")
print("There were " + s1[31:35] + "sightings in 1981")
print("There were " + s2[31:35] + " sightings in 1982")
print("There were " + s3[31:35] + " sightings in 1983")

ET
There were 132
sightings in 1981
There were 133
 sightings in 1982
There were 121
 sightings in 1983


And last we will repeat this process with the 1996 movie Independence Day.

In [24]:
#Independence Day (1996)

#look at the us only counts for the year 1995, year before release
yearbefore = us_only.value_counts(df['year'] == 1995)

#look at the us only counts for the year 1996, year of release
yearof = us_only.value_counts(df['year'] == 1996)

#look at the us only counts for the year 1997, year after release
yearafter = us_only.value_counts(df['year'] == 1997)

#convert variables to strings so they are readable when we print them
s1 = str(yearbefore)
s2 = str(yearof) 
s3 = str(yearafter) 

#print only the numbers of interest, the true values, whose index is 31-35
print("Independence Day")
print("There were " + s1[31:35] + "sightings in 1995")
print("There were " + s2[31:35] + " sightings in 1996")
print("There were " + s3[31:35] + " sightings in 1997")

Independence Day
There were 413
sightings in 1995
There were 431
 sightings in 1996
There were 976
 sightings in 1997


### Findings

Using excel I graphed the data for the year before, year of, and year after, for each movie. Here are the three graphs, note the maximum differences on the y axis. These graphs can also be downloaded as pngs in the notebook file.

![Screen Shot 2022-06-02 at 12.14.11 PM.png](attachment:ca5cfa44-0294-4a8e-81ec-a04e3d266252.png)

![Screen Shot 2022-06-02 at 12.14.55 PM.png](attachment:f1881921-17b7-4052-b52d-ac25dbb1cc79.png)

![Screen Shot 2022-06-02 at 5.41.36 PM.png](attachment:e103c085-6959-4745-9817-28409f226e87.png)

Last here are all the graphs combined with each other, to get a greater sense of how they compare over the decades, you can see better the sharp increase overall in the 1990s. 

![Screen Shot 2022-06-02 at 8.22.19 PM.png](attachment:11fa7e4d-99c1-4548-a7a4-37594194d75c.png)

Overall there are more sightings in the 1990s than the other two decades.

#### Statistical Tests

Using a independent T-Test for 2 Independent Means with the calculator found "[here](https://www.socscistatistics.com/tests/studentttest/default2.aspx)" I calcuated each movies pre and post values as compared them to each other using a 0.05 p value and two tailed conditions. 

##### Close Encounters <> ET

The t-value is -0.17015. The p-value is .880549. The result is not significant at p < .05.


##### Close Encounters <> Independence Day

The t-value is -0.80437. The p-value is .505599. The result is not significant at p < .05.


##### ET <> Independence Day

The t-value is -0.61334. The p-value is .602113. The result is not significant at p < .05.

Overall there was no significance in between the data. 


### Discussion 

Overall this data was found to be statistically insignificant when compared to each other on all counts. Although 2 of the 3 movies (Close Encounters and Indpendence Day) did have an increase after the release date. This trend, although not statisitcally significant, its existence may prompt more research into this topic area. There was also an interesting trend in the split of the decades. I chose 3 movies that were far away from each other in time to hopefully mitigate overlap in influence. Looking at the numbers generally the 1977 had report counts in the 200s, the 1982 movie had counts in the 130s and the 1996 movie had counts starting in the 400s, higher than the other two, but one year post movie release had counts in the 900s. This insight I think also prompts more research into the national events or perhaps other alien movie releases occuring in that year, to try to find an explanation for why it is so starkly higher than the other values. 

#### Limitations and Implications

This study was limited in that I only looked at three movies from different decades. This limited me in what trends I could see perhaps for a time period because I only had one. Perhaps expanding to the top 20 or 50 alien movies for a decade could better help me find trends. As for implications, I think this data is a good starting point, there are some interesting trends highlighted in the discussion that I think could be expanded upon and perhaps warrant some statistical significance. Additionally I chose the three movies based on my personal preference and their general success as movies. I knew they were popular/high grossing films but were not chosen with any sort of rigor. When choosing the next 20 or 50 movies I would rank all alien movies for a decade then rank them based on popularity then randomly chose 20 to 50. This would better reduce bias in my findings. Last movies were not the only thing happening in these years, there may be external world or national events that may have a greater impact on sighting frequency increases. 

### Conclusion

Overall there was no significance to the release of major alien movies and UFO sighting frequency but considering the small sample size there is plenty room to grow. From this study it looks like my my hypothesis of "After major alien movie releases UFO sighitings increase in frequnecy as opposed to time periods before the movie was released." did not ring true statistically speaking. It is important to note that the numbers did increase from pre sighting years in 2 of the 3 movies (Close Encounters and Independence Day) but from this test the conclusion cannot be drawn that this increase is dependent on the movie release. This study pushes my thinking and excites me for further study into what may actually cause this increase, and espeically in the year 1997 where sightings were multitudes higher than any other year. Last, as this is not the reason for alien sightings, perhaps we could consider that aliens are real and the sighting frequency is dependent on something else entirely. 

#### Resources
Simón, A. (1981). A quantitative, nonreactive study of mass behavior with emphasis on the cinema as behavioral catalyst. Psychological Reports, 48(3), 775-785.

Graham, R., & Alford, M. (2011). A history of government management of ufo perceptions through film and television. 49th Parallel, 25.