# Analiza i wizualicja danych


* wczytanie i transformacja danych
* wizualizacje


## Wczytanie danych

Bedziemy dzialac na danych na danych z ankiety StackOverflow z 2019.

https://insights.stackoverflow.com/survey/

Dane sa dostepne na google drive. Nie mozemy tego latwo sciagnac w bashu lub cmdline.
Jest dostepny modul wydany google ktory pozwala pobrac dokument o podanym id.

Sprobujmy go zainstalowac

Teraz musimy odswiezyc strone zeby zobaczyc nowy kernel

In [1]:
from google_drive_downloader import GoogleDriveDownloader as gdd

In [2]:
from pathlib import Path
path = str(Path.home()) + "/data/survey.zip"
path

'/home/jovyan/data/survey.zip'

In [3]:
gdd.download_file_from_google_drive(file_id='1QOmVDpd8hcVYqqUXDXf68UMDWQZP0wQV',
                                    dest_path=path,
                                    unzip=True)

Downloading 1QOmVDpd8hcVYqqUXDXf68UMDWQZP0wQV into /home/jovyan/data/survey.zip... Done.
Unzipping...Done.


## ZADANIE

Zapoznaj sie z plikami tekstowymi, ich wielkosciami. Wgraj plik do katalogu na HDFS /user/{USER}/survey/data

In [None]:
%%bash
ls
head -2 survey_results_public.csv
wc -l survey_results_public.csv
cat survey_results_schema.csv
hdfs dfs -mkdir -p /user/${USER}/survey/data/
hdfs dfs -put -f  survey_results_public.csv /user/${USER}/survey/data/

##  Stworz tabele z danymi korzystajac z sesji sparkowej
skorzystaj ze swojej bazy danych, ktora tworzylismy ostatnio. Tabele nazwijmy survey

In [None]:
import os
user_name = os.environ.get('USER')
print(user_name)

In [None]:
import random
port_number = random.randint(4000,4999)
print(port_number)

In [None]:
%%bash 
source my_env/bin/activate
pip install pyspark
deactivate

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.master('yarn-client') \
.config('spark.driver.memory','1g')\
.config('spark.executor.memory', '1g') \
.config('spark.ui.port', port_number) \
.appName(f'survey_{user_name}') \
.getOrCreate()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
%matplotlib inline  

## Zapisanie danych 

In [None]:
path = f'/user/{user_name}/survey/data/survey_results_public.csv'

In [None]:
db_name = user_name.replace('-','_')
table_name = "survey"

In [None]:
# spark.sql("show databases").show()
spark.sql(f'DROP DATABASE IF EXISTS {db_name} CASCADE')
spark.sql(f'CREATE DATABASE {db_name} LOCATION "/edugen/db/{db_name}"')
spark.sql(f'USE {db_name}')

In [None]:
spark.sql(f'DROP TABLE IF EXISTS {table_name}')

spark.sql(f'CREATE TABLE IF NOT EXISTS {table_name} \
          USING csv \
          OPTIONS (HEADER true, INFERSCHEMA true) \
          LOCATION "{path}"')

## Weryfikacja danych 
Sprawdzmy typy danych

In [None]:
spark.sql(f"describe {table_name}").show()
# nie wszystkie dane ...

In [None]:
spark.sql(f"describe {table_name}").show(100)
# niepoprawne typy danych... "NA" 

In [None]:
spark.sql(f"select distinct Age from {table_name} order by Age desc").show()

## Obsługa wartosci 'NA'

In [None]:
spark.sql(f'DROP TABLE IF EXISTS {table_name}')

spark.sql(f'CREATE TABLE IF NOT EXISTS {table_name} \
          USING csv \
          OPTIONS (HEADER true, INFERSCHEMA true, NULLVALUE "NA") \
          LOCATION "{path}"')



In [None]:
spark.sql(f"describe {table_name}").show(100)

In [None]:
spark.sql(f"select * from {table_name} limit 10").toPandas()

In [None]:
spark.sql(f"select count(*) from {table_name}").show()

## Narysuj histogram wieku respondentów

In [None]:
ages = spark.sql(f"select cast (Age as int) \
                    from {table_name} \
                    where Age is not null \
                    and age between 10 and 80").toPandas()


In [None]:
ages.hist("Age", bins=10)
plt.show()

In [None]:
sns.distplot(ages, bins=10, rug=True, kde=False)

## Ilu jest programistów hobbistów? Jak to wygląda u kobiet a jak u mężczyzn?

In [None]:
hobby = spark.sql(f"select Hobbyist,count(*) as cnt from {table_name} group by Hobbyist").toPandas()
hobby_men = spark.sql(f"select Hobbyist,count(*) as cnt from {table_name} where Gender='Man' group by Hobbyist").toPandas()
hobby_women = spark.sql(f"select Hobbyist,count(*) as cnt from {table_name} where Gender='Woman' group by Hobbyist").toPandas()


In [None]:
hobby.plot.pie(y='cnt', labels=hobby['Hobbyist'], title="All")
plt.legend(loc="center left")
plt.show()

In [None]:
hobby_men.plot.pie(y='cnt', labels=hobby_men['Hobbyist'])
plt.legend(loc="center left")
plt.show()

In [None]:
hobby_women.plot.pie(y='cnt', labels=hobby_women['Hobbyist'])
plt.legend(loc="lower center")
plt.show()

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15,5))
# axes[0]

hobby.plot.pie(y='cnt', labels=hobby['Hobbyist'], title="All", ax=axes[0], autopct='%.0f')
hobby_men.plot.pie(y='cnt', labels=hobby['Hobbyist'], title="Men", ax=axes[1], autopct='%.0f')
hobby_women.plot.pie(y='cnt', labels=hobby['Hobbyist'], title="Women", ax=axes[2], autopct='%.0f')

plt.show()

### Narysuj wykres zależności między wiekiem a liczbą przepracowanych godzin dla developerów zawodowych w przedziale wiekowym 18-65.

In [None]:
age_work = spark.sql(f"Select Age, avg(WorkWeekHrs) as avg from {table_name} \
            where Age is not null and WorkWeekHrs is not null and Age between 18 and 65 and hobbyist = 'No' \
            group by Age \
            order by Age asc").toPandas()

In [None]:
age_work

In [None]:
age_work.plot(x='Age', y='avg', kind='line')

In [None]:
sns.relplot(x="Age", y="avg", kind="line", data=age_work);

In [None]:
spark.sql(f"select distinct DevType from {table_name} where DevType like '%Data scientist%' ").show(truncate=False)

## Wyswietl wykres słupkowy liczby respondentów na kraj

In [None]:
max_countries = spark.sql(f"select country, count(*) as cnt \
                from {table_name} \
                group by Country \
                order by cnt DESC \
                limit 10 ").toPandas()

In [None]:
max_countries.plot.bar(y='cnt', x='country')
plt.show()

In [None]:
sns.catplot(x="country", y="cnt", kind="bar",\
            data=max_countries)\
.set_xticklabels(rotation=65)

## Wyswietl średnie zarobki w  krajach w ktorych jest powyzej 1000 respondentow

In [None]:
country_salary = spark.sql(f"select country, \
    cast (avg(ConvertedComp) as int) as avg \
    from {table_name} \
    where country is not null \
    group by country \
    having count(*) > 1000 \
    order by avg desc \
    limit 50 ").toPandas()

In [None]:
country_salary.plot.barh(("country"))
plt.show()

## Pokaz rozklad pensji w krajach gdzie jest powyzej 1000 respondentów

In [None]:
country_comp = spark.sql(f"select country, cast(ConvertedComp as int) \
                from {table_name} \
                where country IN (select country from {table_name} group by country having count(*) > 1000) \
                and ConvertedComp is not null and ConvertedComp > 0\
                order by ConvertedComp desc").toPandas()

In [None]:
country_comp.boxplot(column="ConvertedComp", by="country", \
                     showfliers=False, rot=60, meanline=True)
plt.show()

In [None]:
sns.catplot(x="country", y="ConvertedComp", kind="box", \
            showfliers=False, data=country_comp, palette="Blues")\
    .set_xticklabels(rotation=65)

## ZADANIE:
Narysuj rozklad pensji w zaleznosci od plci.

In [None]:
df_all = spark.sql(f"select Gender, cast(ConvertedComp as int) as ConvertedComp \
                    from {table_name} where Gender is not null and ConvertedComp is not null \
                    and Gender in ('Man', 'Woman')").toPandas()

In [None]:
#df_all['Gender'] = df_all['Gender'].replace('Non-binary, genderqueer, or gender non-conforming','Non-binary', regex=True)

In [None]:
df_all.boxplot(by="Gender", column="ConvertedComp", showfliers=False, rot=60)

## Narysuj wykres popularnosci jezykow programowania

In [None]:
lang = spark.sql(f"select LanguageWorkedWith from {table_name}").toPandas()


In [None]:
languages = lang["LanguageWorkedWith"].str.split(';', expand=True)
languages

In [None]:
summary = languages.apply(pd.Series.value_counts)
summary

In [None]:
summary2 = pd.DataFrame({'count':summary.sum(axis=1)})
summary2.sort_values("count", inplace=True, ascending=True)
summary2

In [None]:
from matplotlib.pyplot import figure

figure(num=None, figsize=(8, 9), dpi=80, facecolor='w', edgecolor='k')
plt.barh(width=summary2["count"], y=summary2.index)
plt.show()

In [None]:
summary2['index'] = summary2.index

sns.catplot(x="index", y="count", kind="bar", \
             data=summary2, palette="Blues") \
.set_xticklabels(rotation=65)

## Narysuj wykres popularnosci jezykow wsrod Data Scientists


In [None]:
def prepare_lang(df):
    languages = df["LanguageWorkedWith"].str.split(';', expand=True)
    summary = languages.apply(pd.Series.value_counts)
    summary2 = pd.DataFrame({'count':summary.sum(axis=1)})
    summary2.sort_values("count", inplace=True, ascending=True)
    return summary2

In [None]:
lang = spark.sql(f"select LanguageWorkedWith \
                from {table_name} \
                where DevType like '%Data scientist%'").toPandas()

sum_lang = prepare_lang(lang)


figure(num=None, figsize=(8, 9), dpi=80, facecolor='w', edgecolor='k')
plt.barh(width=sum_lang["count"], y=sum_lang.index)
plt.show()

In [None]:
sns.set(style="ticks", color_codes=True)
sns.distplot(spark.sql(f"select Age from {table_name} where Age is not null").toPandas())

# Wykształcenie... 

In [None]:
spark.sql(f"select distinct EdLevel from {table_name}").show(truncate=False)

In [None]:
from pyspark.sql.functions import *

ed_level = spark.sql(f"select EdLevel, WorkWeekHrs from {table_name} \
            where EdLevel is not null and YearsCodePro is not null \
            and cast(WorkWeekHrs as int) is not null \
            and cast(WorkWeekHrs as int) between 10 and 80 \
            and (EdLevel like '%Bachelor%' or EdLevel like '%Master%' or EdLevel like '%Other doctoral%')")

ed_pandas = ed_level.toPandas()
ed_pandas['EdLevel'] = ed_pandas['EdLevel'].replace('Bachelor’s degree (BA, BS, B.Eng., etc.)','Bachelor')
ed_pandas['EdLevel'] = ed_pandas['EdLevel'].replace('Master’s degree (MA, MS, M.Eng., MBA, etc.)','Master')
ed_pandas['EdLevel'] = ed_pandas['EdLevel'].replace('Other doctoral degree (Ph.D, Ed.D., etc.)','Doctor')

ed_pandas

In [None]:
sns.catplot(x="EdLevel", y="WorkWeekHrs", data=ed_pandas)


## Narysuj wykres boxplot pokazujacy rozklad dochodów w zależności od wykształcenia

In [None]:

from pyspark.sql.functions import *

ed_level = spark.sql(f"select EdLevel, cast (CompTotal as int) as CompTotal from {table_name} \
            where cast(CompTotal as int) between 0 and 1000000  \
            and (EdLevel like '%Bachelor%' or EdLevel like '%Master%' or EdLevel like '%Other doctoral%')")

ed_pay = ed_level.toPandas()
ed_pay['EdLevel'] = ed_pay['EdLevel'].replace('Bachelor’s degree (BA, BS, B.Eng., etc.)','Bachelor')
ed_pay['EdLevel'] = ed_pay['EdLevel'].replace('Master’s degree (MA, MS, M.Eng., MBA, etc.)','Master')
ed_pay['EdLevel'] = ed_pay['EdLevel'].replace('Other doctoral degree (Ph.D, Ed.D., etc.)','Doctor')

ed_pay.max

In [None]:
sns.catplot(x="EdLevel", y="CompTotal", kind="boxen",
            data=ed_pay);

## Narysuj wykres wiolinowy pokazujacy rozklad dochodów w zależności od wykształcenia

In [None]:

from pyspark.sql.functions import *

ed_level = spark.sql(f"select EdLevel, Gender, cast (CompTotal as int) as CompTotal from {table_name} \
            where cast(CompTotal as int) between 1000 and 500000  \
            and gender in ('Man', 'Woman') \
            and (EdLevel like '%Bachelor%' or EdLevel like '%Master%' or EdLevel like '%Other doctoral%')")

ed_pay = ed_level.toPandas()
ed_pay['EdLevel'] = ed_pay['EdLevel'].replace('Bachelor’s degree (BA, BS, B.Eng., etc.)','Bachelor')
ed_pay['EdLevel'] = ed_pay['EdLevel'].replace('Master’s degree (MA, MS, M.Eng., MBA, etc.)','Master')
ed_pay['EdLevel'] = ed_pay['EdLevel'].replace('Other doctoral degree (Ph.D, Ed.D., etc.)','Doctor')

ed_pay



## Narysuj wykres pokazujacy rozklad dochodów w zależności od wyksztalcenia i płci

In [None]:
df = spark.sql(f"select Age, DevType, cast (ConvertedComp as int) as ConvertedComp \
            from {table_name} \
            where cast(ConvertedComp as int) between 0 and 300000 \
            and DevType is not null").toPandas()

df['DS'] = df['DevType'].apply(lambda x: 1 if 'Data scientist' in x else 0)
df['AR']=df['DevType'].apply(lambda x: 1 if 'Academic researcher' in x else 0)
df['MGR']=df['DevType'].apply(lambda x: 1 if 'anager' in x else 0)



In [None]:
sns.catplot(x="EdLevel", y="CompTotal", hue="Gender", kind="violin", split=True, data=ed_pay)

In [None]:
sns.scatterplot(x="ConvertedComp", y="Age", hue='AR', data=df ,alpha=0.6)

In [None]:
sns.scatterplot(x="ConvertedComp", y="Age", hue='MGR', data=df ,alpha=0.6)

In [None]:
spark.sql("select distinct DevType from survey").show(truncate=False)

In [None]:
so_v = spark.sql("select SOVisitFreq, country, count (*) as cnt from survey \
            where country is not null and SOVisitFreq is not null \
            and country in ('Poland', 'United States', 'Russian Federation', 'China', 'India') \
            group by country, SOVisitFreq").toPandas()
so_v

In [None]:
heatmap2_data = pd.pivot_table(so_v, values='cnt', index=['country'], columns='SOVisitFreq')
sns.heatmap(heatmap2_data, cmap="BuGn")

## Narysuj heatmape odwiedzin na StackOverflow dla wybranych krajów

In [None]:
spark.sql("select distinct SOVisitFreq from survey").show()

In [None]:
so_v = spark.sql("select SOVisitFreq, t1.country, count(*)/first(t2.t) as cnt from survey t1 \
            join (select country, count(*) as t from survey group by country) t2 \
            on t1.country = t2.country \
            where t1.country is not null and SOVisitFreq is not null \
            and t1.country in ('Poland', 'United States', 'Russian Federation', 'China', 'India', 'Germany', 'Japan') \
            group by t1.country, SOVisitFreq").toPandas()

so_v['SOVisitFreq'] = pd.Categorical(so_v['SOVisitFreq'], ["I have never visited Stack Overflow (before today)", "Less than once per month or monthly", "A few times per month or weekly", "A few times per week", "Daily or almost daily", "Multiple times per day"])
# so_v.sort_values['SOVisitFreq']


In [None]:
heatmap2_data = pd.pivot_table(so_v, values='cnt', index=['country'], columns='SOVisitFreq')
sns.heatmap(heatmap2_data, cmap="BuGn")