## Visualization exercise
```In the following exercise you are to practice simple data scientist tasks. Mainly, you learn about a few common, useful but slightly advanced visualization methods. You will work with the Chicago Crime data, which concerns crimes in Chicago. You will be asked to draw some of this data features.```

```~Ittai Haran```

In [None]:
import seaborn as sns
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

```Start with loading the data.
link for the data: https://drive.google.com/open?id=1oy7hnl3u8IYt7U69kOqATwbooEXSUzrm```

In [None]:
df = pd.read_csv('D:\Download\Crimes_-_2001_to_present.csv')

```What are the types of the columns? Which columns contain numbers? Which contain nans?
Among the categorical columns that contain nans, how many distinct values are there?```

In [None]:
print('types:')
df.info(null_counts=True)

In [None]:
print('columns which contain numbers:')
df.loc[:, (df.dtypes == np.float64) | (df.dtypes == np.int64) ].columns

In [None]:
print('check all columns - if contains nan:')
df.isnull().sum()

In [None]:
print('only columns which contain nans')
nan_cols = [i for i in df.columns if df[i].isnull().any()]
nan_cols

In [None]:
print('only categories columns which contain nans')
nan_cols_categories = [i for i in df.columns if df[i].isnull().any() and df[i].dtype=='object']
print(nan_cols_categories)
print('disrict vales for there feature')
df[nan_cols_categories].nunique()

```Plot the distribution of each of the numeric features (hist: plt.hist, or pd.DataFrame.hist).
Also, If there are columns that have missing values, but also have less than 200 different values, plot their histogram ( maybe by using sns.categorical.countplot).```

In [None]:
sns.set(rc={'figure.figsize':(10,3)})
num_features = df.loc[:, (df.dtypes == np.float64) | (df.dtypes == np.int64) ].columns
for feat in num_features:
    df[feat].hist()
    plt.title(feat)
    plt.show()

In [None]:
sns.set(rc={'figure.figsize':(10,25)})
for feat in nan_cols_categories:
    if(df[feat].nunique()<200):
        sns.countplot(y=feat, data=df , order = df[feat].value_counts().index)
        plt.show()

```Now plot the number of crimes in Chicago per month, and the number of arrests per month. Use df.plot.```

In [None]:
#extracing the month from the date
df['Month'] = df['Date'].apply(lambda x: x[:2])

In [None]:
sns.set(rc={'figure.figsize':(20,5)})
ax = sns.countplot(x='Month', data=df)
plt.title('number of crimes in Chicago per month')
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()*1.005))

```Do the same for weeks rather than months. Use df.resample.```

In [None]:
df['LightDate'] = df['Date'].apply(lambda x: x[0:10]) #i did it because it took too long to convert the whole date to datetime
df['LightDate'] = pd.to_datetime(df['LightDate'])
df['Week'] = df['LightDate'].dt.week

sns.set(rc={'figure.figsize':(50,5)})
ax = sns.countplot(x='Week', data=df)
plt.title('number of crimes in Chicago per month')
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()*1.005))

In [None]:
df['Arrest']

```Lets look at the distribution of 'Ward', for Arrest=True and for Arrest=False. Use sns.violinplot.```

In [None]:
sns.set(rc={'figure.figsize':(10,10)})
plt.show()
sns.violinplot(data=df, y='Ward' , x='Arrest'  , palette="muted")

```Plot in the same graph, for each Primary Type, the number of crimes of this type, for each month. Do it using df.pivot_table.```

In [None]:
df['combine_date'] = (df['Year'].astype(str)) + (df['Month'].astype(str))
#df['combine_date'] = df['combine_date'].astype('int32')

In [None]:
pivot_table = df.pivot_table(index='Primary Type' , columns='combine_date' , values='ID' ,aggfunc='count' )

In [None]:
sns.set(rc={'figure.figsize':(30,10)})
for prim in pivot_table.index:
    sns.lineplot(data = pivot_table.loc[prim], label=prim)
    plt.xticks(rotation=90)
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

```Let's focus on the features 'X Coordinate', 'Y Coordinate'. What is the relation between them? Use sns.pairplot and sns.jointplot to answer this question. You might want to get rid of the missing values before you act. Can you find another problem preventing you from understanding the relations between the features? What is it and how can you get rid of it?```

In [None]:
sns.pairplot(data=df[['X Coordinate', 'Y Coordinate']])

In [None]:
sns.jointplot(data=df[['X Coordinate', 'Y Coordinate']] , x='X Coordinate', y ='Y Coordinate')

In [None]:
df[['X Coordinate', 'Y Coordinate']].min()

In [None]:
helper_df_cordinate = df[['X Coordinate', 'Y Coordinate']][df[['X Coordinate', 'Y Coordinate']]!=0].dropna()

In [None]:
sns.pairplot(data=helper_df_cordinate[['X Coordinate', 'Y Coordinate']])

In [None]:
sns.jointplot(data=helper_df_cordinate, x='X Coordinate', y ='Y Coordinate', kind="hex")

```Split the map into 25 districts, and plot the number of crime incidents in each one of them, per month. First rotate the map so it will be more 'square-like' (Hint: by a linear transformation).```

In [None]:
sns.scatterplot(data=df[df['X Coordinate']!=0], y='X Coordinate', x='Y Coordinate' , hue='District')

In [None]:
pivot_table = df.pivot_table(index='District' , columns='combine_date' , values='ID' ,aggfunc='count' )
sns.set(rc={'figure.figsize':(20,15)})
for prim in pivot_table.index:
    sns.lineplot(data = pivot_table.loc[prim], label=prim)
    plt.xticks(rotation=90)
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

```Bonus: create a word cloud from the words you can find in the 'Description' field.```

In [None]:
#!pip install wordcloud

In [None]:
from wordcloud import WordCloud, STOPWORDS

In [None]:
Description = []
Description.extend(df['Description'])
Description = ''.join(Description)

In [None]:
wordcloud = WordCloud(stopwords=STOPWORDS,
                          background_color='white',
                          width=1200,
                          height=1000
                         ).generate(Description)

In [None]:
plt.imshow(wordcloud)
plt.axis('off')
plt.show()