# Gun Violence Data 
# Capstone Project --- Yu Shang)
## ALY6140

## Introduction

### This dataset is downloading from https://github.com/jamesqo/gun-violence-data
### This repository contains data for all recorded gun violence incidents in the US between January 2013 and March 2018. This record includes more than 260k gun violence incidents, with detailed information about each incident and avaliable in CSV format. It is for data scientists and statisticians to study gun violence and predict future trends.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Read the csv file
data = pd.read_csv('C:\\Users\\boris\\Desktop\\Capstone Project\\gun-violence-data(Yu Shang).csv')
# Consider which columns will be used in the project
data.columns

## Data Cleanup

### This dataset contains complex data types. Most of them are objects and combined with strings and symbols. For example "participant_age" is stored as: 0::25||1::31||2::33||3::34||4::33, which means there are four pariticipant criminals and the first person with an age of 25, second person with an age of 31... etc. Same as "participant_gender": 0::Male||1::Male||2::Male||3::Male||4::Male

In [None]:
# Filtering the columns that I will need in this project
data[['date', 'state', 'city_or_county', 'n_killed', 'n_injured','gun_type', 'incident_characteristics', 'location_description',
      'n_guns_involved', 'participant_age', 'participant_age_group', 'participant_gender','participant_name', 'participant_relationship', 
      'participant_status',]].head()

In [None]:
# converting the original column "participant_age" to a cleaned column "Age"
def convert_format(x):
    x = str(x)
    y = x.split("||")
    d = {}
    for item in y:
        ylist = item.split("::")
        if len(ylist) > 1:
            d[ylist[0]] = ylist[1]
    if not d:
        return[]
    return list(d.values())
data['Age'] = data['participant_age'].apply(lambda x:convert_format(x))
data[['participant_age','Age']].head(10)

In [None]:
# converting the original column "date" to a cleaned column "Year", for better category on years
def convert_year(aa):
    bb = aa.split('-')
    return bb[0]   
data['Year'] = data['date'].apply(lambda aa:convert_year(aa))
data[['date','Year']].head(10)

In [None]:
# converting the original column "date" to three cleaned column "Year", "Month", and "Day" .
def convert_month(aa):
    bb = aa.split('-')
    return int(bb[1])   
data['Month'] = data['date'].apply(lambda aa:convert_month(aa))
def convert_days(aa):
    bb = aa.split('-')
    return int(bb[2])   
data['Day'] = data['date'].apply(lambda aa:convert_days(aa))
data[['date','Year','Month','Day']].head(10)

In [None]:
# converting the original column "participant_gender" to a cleaned column "Gender".
data['Gender'] = data['participant_gender'].apply(lambda x:convert_format(x))
data[['Gender','participant_gender']].head(10)

## Analyzing

### After few searching according to the legal ages for buying guns, I found out that  under federal law, the minimum age to buy a handgun from a licensed dealer is 21. But the age limit drops to 18 if the gun is being purchased from a private, unlicensed seller, which could be a neighbor or someone online, or at gun show. 

In [None]:
# For better use of the modified column "Age", I created a list of all participants age
age_list = []
i = 0
while i < len(data.Age):
    age_list.extend(data.Age[i])
    i = i + 1
age_series = pd.Series(age_list)


age_int = age_series.astype(int)
age_to_list = age_int.values.tolist()
agelist = []
# Filtering the age list (14 < AGE < 61)
for a in age_to_list:
    if a > 14 and a < 61:
        agelist.append(a)
age_clean = pd.Series(agelist)
age = age_clean.value_counts().sort_index(axis = 0)
age

### This data visualization graph shows correlation between age and the number of pariticipants in criminal. As we can see from the graph, according to the trend line shows, as the ages are increasing, the number of participants in criminals are decreasing. They have a strong negative correlation.

In [None]:
# Drawing a scatter plot
plt.scatter(age.index,age.values)
plt.xlabel('Age')
plt.ylabel('Number of Criminal Participants')
plt.title('Age vs. Number of Participants Involved in Gun Violence')

In [None]:
age_frame = pd.DataFrame({'Age': age.index, 'Number of Participants in Criminals': age.values})
sns.regplot(x = 'Age', y = 'Number of Participants in Criminals', data = age_frame, fit_reg = True)
plt.ylim(0, 12000)
plt.title('Age vs. Number of Participants Involved in Gun Violence')

### This histogram shows people who participant the gun violence under different age. We noticed that most of the gun violence participants are aged around 18,19,20. According to this histogram, I think we should enhance the controls of guns among teenagers and people who sell the guns to them. 

In [None]:
age123 = age_series.value_counts().head(15).sort_index(axis = 0)
plt.bar(age123.index, age123.values, 0.8)
plt.ylabel('Number of Participants at that Age')
plt.xlabel('Participant Age')
plt.title('People who participant the gun violence under different age')

### This graph shows the total amount of people who got injured and killed in different year. Since the data are only recorded until March 2018, so I believe that there are more victims in 2018 than it shows. The trend of people who are getting killed and injured by gun shot are not decreasing, but as we can see, the victims from 2017 has no dramatically changed from 2016.

In [None]:
# The number of people got killed during the gun violence are classified by years
n_kill = data.groupby('Year')['n_killed'].sum()
n_kill

In [None]:
# The number of people got injured during the gun violence are classified by years
n_injury = data.groupby('Year')['n_injured'].sum()
n_injury

In [None]:
#plot the histogram with years on x-axis and number of people on the y-axis.
plt.bar(n_kill.index, n_kill.values, bottom = n_injury.values)
plt.bar(n_injury.index, n_injury.values)
plt.legend(['People got killed', 'People got injured'])

### The purpose of showing these two historgrams is to analyze whethere there is a specific month or day on gun violence. Maybe any religions and festivals invloved? According to what I found, there is no special days, but the number of victims are huge  in the beginning of the year, espeically in January and March.

In [None]:
kill_day = data.groupby('Day')['n_killed'].sum()
plt.bar(kill_day.index, kill_day.values)

In [None]:
kill_month = data.groupby('Month')['n_killed'].sum()
plt.bar(kill_month.index, kill_month.values)

### Incident happened among genders. As we can see, 88% of the participants are males and 12% are females

In [None]:
#Calculate the participants by gender
gender_list = []
i = 0
while i < len(data.Gender):
    gender_list.extend(data.Gender[i])
    i = i + 1
gender_series = pd.Series(gender_list)
gender123 = gender_series.value_counts().head(2)
gender123

In [None]:
#Creating a pie chart for better visualization
total = gender123[0] + gender123[1]
female_percent = gender123[1] / total * 100
male_percent = gender123[0] / total * 100
plt.figure(figsize=(10, 6))
plt.pie(x=[round(female_percent),round(male_percent)], explode=[0, 0.1], autopct='%0.1f%%', labels=['Female', 'Male'])

### This part shows people that are killed in gun violence by different state. California, Texas, and Florida has the most victims that are killed by gun shot.

In [None]:
nums1 = data.groupby('state')['n_killed'].sum()
nums11 = nums1.sort_values(ascending=False).head(10).sort_index(axis = 0)
nums1.sort_values(ascending=False)

In [None]:
#bar graph
plt.barh(nums11.index, nums11.values, 0.8)
plt.ylabel('State')
plt.xlabel('People got Killed')
plt.title('People Who Get Killed By Gun Shot In Top 10 State')

### This part shows people that are injured in gun violence by different state. Illinois, California, and Florida  has the most victims that are injured by gun shot.

In [None]:
nums2 = data.groupby('state')['n_injured'].sum()
nums22 = nums2.sort_values(ascending=False).head(10).sort_index(axis = 0)
nums2.sort_values(ascending=False)

In [None]:
plt.barh(nums22.index, nums22.values, 0.8)
plt.ylabel('State')
plt.xlabel('People Injured')
plt.title('People Who Get Injured By Gun Shot In Top 10 State')

### After demonstrating the number of people that are killed and injured in different state separately, I want to combine them to find out where the gun violence are happened the most. Then we can better control and regualte the gun law. The result are Illinois, California, Texas, Florida, and Ohio. 

In [None]:
victim = pd.concat([nums1, nums2], axis = 1)
victim['total'] = victim['n_killed'] + victim['n_injured']
victim

In [None]:
vic = victim.nlargest(5,'total')
plt.bar(vic.index, vic['n_killed'], bottom = vic['n_injured'])
plt.bar(vic.index, vic['n_injured'])
plt.legend(['People got killed', 'People got injured'])
plt.xlabel('State')
plt.ylabel('Number of Victims')
plt.title('Top 5 States With Most Victims')

## Predictive Analytics

### In this part, I had two hypotheses. The first one I assumed that there is a negative correlation between the age of the criminal participants and the number of victims in the gun violence. The second one I assumed there is a positive correlation between the number of criminal participants and the number of victims in the gun violence.

In [None]:
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf

def convert_format(x):
    x = str(x)
    y = x.split("||")
    d = {}
    for item in y:
        ylist = item.split("::")
        if len(ylist) > 1:
            d[ylist[0]] = ylist[1]
    if not d:
        return np.nan
    return list(d.values())
data['cleaned_Age'] = data['participant_age'].apply(lambda x:convert_format(x))
data[['participant_age','cleaned_Age']].head(10)

In [None]:
age_kill = data[['cleaned_Age','n_killed','n_injured']].dropna()
df1 = pd.DataFrame(age_kill)
df1.head()

In [None]:
def Average(x):
    sum = 0
    count = 0
    for i in x:
          sum = sum + int(i)
    return sum / len(x)

def n_participant(y):
    return len(y)

df1['Age_mean'] = df1['cleaned_Age'].apply(lambda x: Average(x))
df1['number_of_participant'] =  df1['cleaned_Age'].apply(lambda y: n_participant(y))
df1['Victims'] = df1['n_killed'] + df1['n_injured']
df1[['Age_mean','n_killed','n_injured','Victims','number_of_participant']]

In [None]:
df2 = df1[(df1['Age_mean'] > 14) & (df1['Age_mean'] < 61)]
df2[['Age_mean','n_killed','n_injured','Victims']]

In [None]:
sns.regplot(x = "Age_mean", y = "Victims", data = df2)
plt.xlabel('Age of criminals')
plt.ylabel('Number of Victims')
plt.title('Age vs. Number of Victims')
plt.xlim(14,60)
plt.ylim(0,40)

In [None]:
from patsy import dmatrices
y,X = dmatrices('Victims ~ Age_mean', data=df2)
print("y:{}".format(y.shape))
print("X:{}".format(X.shape))

In [None]:
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf

result1 = smf.ols('Victims ~ Age_mean', data = df2).fit()
print(result1.params)
result1.summary()

In [None]:
plt.scatter(df2.Age_mean, df2.Victims)
plt.plot(df2.Age_mean, result1.fittedvalues, 'r-')
plt.xlabel('Age_mean')
plt.ylabel('Number of Victims')
plt.title('Trend on number of Victims based on Criminal Participants ages ')

### In conclusion, it shows that there is no correlation bewtween the number of victims and the age of the criminal pariticipants.  I assume my dependent variable is number of the victims and the independent variable is the participants' age. The reason why I made this hypothesis, is because according to previous analysis, I found out that most of the participants are around 18-20, so I want to dig more on this and how they behave.

In [None]:
sns.regplot(x = "number_of_participant", y = "Victims", data = df1)

In [None]:
result3 = smf.ols('Victims ~ number_of_participant', data = df1).fit()
print(result3.params)
result3.summary()

In [None]:
plt.scatter(df1.number_of_participant, df1.Victims)
plt.plot(df1.number_of_participant, result3.fittedvalues, 'r-')
plt.xlabel('Number of Criminal Participants')
plt.ylabel('Number of Victims')
plt.title('Trend on number of Victims based on the number of participant')

### In conclusion, it shows that there is a positive correlation bewtween the number of criminal participants and the number of victims. Which may indicate some gang conflicts and should be better controled by US government to decrease the number of the victims. In this case, the dependent variable is the number of victims and the independent variable is the number of criminal participants. One point is worth mentioning is that, China regulate guns and weapons strictly with laws, it is illegal to possess a gun, whereas people can buy guns with gun license over 21 in united state. I searched that it is lawful to buy a gun with a gun license over 21, but there is no detail declaring the age of possess of the gun, and this will bring market and opportunities for teenagers to access the guns without permission and a right way of using it.