In [1]:
#importing the library
import pandas as pd

In [2]:
# reading the data from csv and seting the index column
df = pd.read_csv('../data/survey_results_public.csv', index_col='Respondent')
schema_df = pd.read_csv('../data/survey_results_schema.csv', index_col='Column')

In [5]:
# setting the maximum display for rows and column to 85
pd.set_option('display.max_columns', 85)
pd.set_option('display.max_rows', 85)

În cadrul acestei părți din tutorial o să ne uităm peste modalitățile de a grupa sau de a agrega detele dintr-un data frame. Aceste două acțiuni este primul lucru la care se gândesc majoritatea persoanelor când se zice că analizăm date. Aceasta este partea din tutorial unde chiar o să preluăm ceva statistici din data frame, și nu doar modificăm datele din cadrul acestui data frame. De exemplu, poate dorim să știm care este salariul mediu pentru un anumit developer, poate dorim să știm câte persoane din fiecare țară știu Python ca și limbaj de programare. În cadrul acestei părți o să putem răspunde la toate aceste întrebări.

Înainte de a trece la partea mai complicată o să începem cu ceva simplu și o să tot avansăm pe parcursul acestui tutorial. Pentru început o să ne uităm la partea de agregare. Agregarea practic înseamnă combinarea mai multor date într-un singur rezultat. O întrebare la care putem răspunde este care este salariul mediu al unui programator din acest studiu. Pentru asta putem să extragem valoarea medie din coloana unde sunt datele despre salariu. Pentru început o să ne uităm la datele din acea coloană

In [6]:
df['ConvertedComp']

Respondent
1            NaN
2            NaN
3         8820.0
4        61000.0
5            NaN
          ...   
88377        NaN
88601        NaN
88802        NaN
88816        NaN
88863        NaN
Name: ConvertedComp, Length: 88883, dtype: float64

Valorile care sunt cu 'NaN' înseamnă că persoanele care au participat la studiu nu au răspuns la acea întrebare. Pentru a vedea valoarea medie a acestei coloane putem să utilizăm metoda `median()` pentru coloana de 'ConvertedComp'

In [7]:
df['ConvertedComp'].median()

57287.0

Informația de sus este destul de utiliă, însă fiind așa multe persoane din țări diferite, salariile sunt diferite în anumite țări deoarece depinde de la țară la țară ce salarii se oferă, depinde de costul de trai în acea țară și de multe altele. O informație mai utilă ar fi să vedem salariile medii ale participaților per țară (o să ne uităm la asta în partea de grupare a datelor). În continuare o să trecem peste alte câteva concepte basic. Un alt exemplu peste care putem să ne uităm ar fi acela de a aplica anumite metode de agregare pentru întreg data frame-ul, nu doar pentru un obiect de tip Series. O să începem prin a utiliza metoda 'median()' pentru întreg data frame-ul

In [8]:
df.median()

  df.median()


CompTotal        62000.0
ConvertedComp    57287.0
WorkWeekHrs         40.0
CodeRevHrs           4.0
Age                 29.0
dtype: float64

Când se utilizează metoda 'median' pentru întreg data frame-ul aceasta o să caute toate coloanele unde sunt date numerice și o să facă media pentru fiecare coloană în parte. Neutilizarea argumentului de 'numeric_only=True' este depricată și o să rezulte cu o eraore pe viitor

In [10]:
df.median(numeric_only=True)

CompTotal        62000.0
ConvertedComp    57287.0
WorkWeekHrs         40.0
CodeRevHrs           4.0
Age                 29.0
dtype: float64

Dacă dorim să vedem o statistică largă a data frame-ului putem utiliza metoda `describe()`. Această metodă o să se aplice din nou doar pentru coloanele care au valori numerice

In [11]:
df.describe()

Unnamed: 0,CompTotal,ConvertedComp,WorkWeekHrs,CodeRevHrs,Age
count,55945.0,55823.0,64503.0,49790.0,79210.0
mean,551901400000.0,127110.7,42.127197,5.084308,30.336699
std,73319260000000.0,284152.3,37.28761,5.513931,9.17839
min,0.0,0.0,1.0,0.0,1.0
25%,20000.0,25777.5,40.0,2.0,24.0
50%,62000.0,57287.0,40.0,4.0,29.0
75%,120000.0,100000.0,44.75,6.0,35.0
max,1e+16,2000000.0,4850.0,99.0,99.0


Metoda aceasta de 'describe()' ne afișează mai multe valori despre coloanele ce conțin numere, cum ar fi valoarea minimă, maximă, media și altele. Diferența dintre 'mean()' și 'median()' este că metoda 'mean()' este puternic influențată de către outliers, și anume de către valori care nu sunt concludente. Știm că sunt unele cazuri în care salariul era trecut la 2.000.000 de dolari pe an, iar aceste valori fac ca metoda 'mean()' pentru salarii să returneze valoarea de 1.271107e+05, care practic înseamnă 127.000 și știm că valorea returnată pentru metoda 'meadina()' a fost de 57.000, valoare care este mult mai apropiată de adevăr.

Detaliile despre 'count' rezultate din apicarea metodei 'describe' numără totalul de persoane care au răspuns la acea întrebare. Cele care nu au răspuns sunt trecute cu valoarea de 'NaN', prin urmare această metodă returnează numărul total de rânduri din data frame minus numărul de persoane care au valoarea 'NaN' pentru o anumită coloană. O să utilizăm metoda `count()` separat pentru coloana 'ConvertedComp'

In [12]:
df['ConvertedComp'].count()

55823

Știind că în data frame avem undeva la aproape 89.000 de răspunsuri, metoda de mai sus ne spune că doar undeva la 55.000 de persoane au răspuns la acea întrebare.

O să trecem acuma din nou la coloana 'Hobbyist'. În cadrul acelei coloane persoanele participante la studiu au răspuns cu 'Yes' sau 'No' dacă scriu cod de plăcere sau nu. Dacă dorim să vedem câte persoane au răspuns cu 'Yes' și câte cu 'No' la acea întrebare, atunci putem să utilizăm metoda `value_counts()`

In [13]:
df['Hobbyist'].value_counts()

Yes    71257
No     17626
Name: Hobbyist, dtype: int64

În cadrul acestui studiu există o coloană care deține informații despre ce site de socializare utilizează cei care au răspuns la studiu cel mai mult. Coloana respectivă poartă denumirea de 'SocialMedia'. Pentru a vedea popularitatea acestor site-uri de socializare putem utiliza metoda 'value_counts()'. Pentru început să vedem însă cum arată datele din acea coloană

In [14]:
df['SocialMedia']

Respondent
1          Twitter
2        Instagram
3           Reddit
4           Reddit
5         Facebook
           ...    
88377      YouTube
88601          NaN
88802          NaN
88816          NaN
88863     WhatsApp
Name: SocialMedia, Length: 88883, dtype: object

In [15]:
df['SocialMedia'].value_counts()

Reddit                      14374
YouTube                     13830
WhatsApp                    13347
Facebook                    13178
Twitter                     11398
Instagram                    6261
I don't use social media     5554
LinkedIn                     4501
WeChat 微信                     667
Snapchat                      628
VK ВКонта́кте                 603
Weibo 新浪微博                     56
Youku Tudou 优酷                 21
Hello                          19
Name: SocialMedia, dtype: int64

În situația în care dorim să vedem aceste date sub formă de procente și nu de valori, atunci putem să utilizăm un argument în cadrul metodei 'value_counts()', argument ce poartă denumirea de 'normalize='. Acestui argument o să îi atribuim valorea True

In [17]:
df['SocialMedia'].value_counts(normalize=True)

Reddit                      0.170233
YouTube                     0.163791
WhatsApp                    0.158071
Facebook                    0.156069
Twitter                     0.134988
Instagram                   0.074150
I don't use social media    0.065777
LinkedIn                    0.053306
WeChat 微信                   0.007899
Snapchat                    0.007437
VK ВКонта́кте               0.007141
Weibo 新浪微博                  0.000663
Youku Tudou 优酷              0.000249
Hello                       0.000225
Name: SocialMedia, dtype: float64

Din aceste date, extragem informația cum că 17% dintre respondenți utilizează cel mai des Reddit, 16% YouTube și așa mai departe. În cadrul acelui output putem vedea că există și anumite site-uri de socializare care par a fi doar din anumite zone, cum ar fi China sau Rusia. Întrebarea este cum putem vedea care este cel mai popular site de socializare pentru fiecare țară în parte. Pentru a răspunde la acestă întrebare trebuie să utilizăm conceptul de grupare. Acest concept poate fi un pic confuz pe la început, așa că o să începem cu ce e mai simplu

Dacă dorim să aflăm anumite date bazate pe țări, atunci trebuie să grupăm datele bazate pe coloana 'Country'. Pentru a grupa un set de date avem la dispoziție metoda `groupby()`. Ce anume înseamnă să utilizăm această metodă? În cadrul documentației Pandas, gruparea înseamnă combinarea de împărțire a unui obiect, aplicarea unei funcții și combinare rezultatului. O să trecem prin fiecare proces a acestei părți pentru a înțelege cum anume funcționează.

Pentru început o să ne uităm la numărul de răspunsuri individuale pentru coloana 'Country' ca să ne facem o idee despre ce date avem

In [18]:
df['Country'].value_counts()

United States        20949
India                 9061
Germany               5866
United Kingdom        5737
Canada                3395
                     ...  
Tonga                    1
Timor-Leste              1
North Korea              1
Brunei Darussalam        1
Chad                     1
Name: Country, Length: 179, dtype: int64

Acum o să ne uităm peste procesul de grupare care presupune cele trei etape de splituire, aplicarea unei funcții și combinarea rezultatelor. O să utilizăm metoda 'groupby()' pentru data frame-ul nostru și o să îi oferim ca și argument numele coloanei (sau coloanelor pentru situațiile în care dorim să grupăm după mai multe coloane) ca și o listă de valori.

In [19]:
df.groupby(['Country'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7faf5088bd30>

Dacă rulăm codul de mai sus ceea ce primim înapoi este un obiect de tipul 'DataFrameGroupBy'. Acest obiect conține o multitudine de grupuri, iar pentru a înțelege mai bine o să ne uităm peste un grup individual al acelui obiect. Pentru a lucra mai ușor cu acel obiect o să îl salvăm într-o variabilă

In [20]:
groupby_country_df = df.groupby(['Country'])

Din moment ce am grupat aceste valori pe baza coloanei 'Country' putem să preluăm un grup din acest obiect pe baza unui nume de țară prezent în data frame. Pentru a prelua un grup din cadrul acestui obiect o să utilizăm metoda `get_group()` la care o să îi atribuim ca și argument o valoare ce reprezintă un nume de țară.

In [21]:
groupby_country_df.get_group('United States')

Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,OrgSize,DevType,YearsCode,Age1stCode,YearsCodePro,CareerSat,JobSat,MgrIdiot,MgrMoney,MgrWant,JobSeek,LastHireDate,LastInt,FizzBuzz,JobFactors,ResumeUpdate,CurrencySymbol,CurrencyDesc,CompTotal,CompFreq,ConvertedComp,WorkWeekHrs,WorkPlan,WorkChallenge,WorkRemote,WorkLoc,ImpSyn,CodeRev,CodeRevHrs,UnitTests,PurchaseHow,PurchaseWhat,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,WebFrameWorkedWith,WebFrameDesireNextYear,MiscTechWorkedWith,MiscTechDesireNextYear,DevEnviron,OpSys,Containers,BlockchainOrg,BlockchainIs,BetterLife,ITperson,OffOn,SocialMedia,Extraversion,ScreenName,SOVisit1st,SOVisitFreq,SOVisitTo,SOFindAnswer,SOTimeSaved,SOHowMuchTime,SOAccount,SOPartFreq,SOJobs,EntTeams,SOComm,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1
4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,100 to 499 employees,"Developer, full-stack",3,16,Less than 1 year,Very satisfied,Slightly satisfied,Very confident,No,Not sure,I am not interested in new job opportunities,Less than a year ago,"Write code by hand (e.g., on a whiteboard);Int...",No,"Languages, frameworks, and other technologies ...",I was preparing for a job search,USD,United States dollar,61000.0,Yearly,61000.0,80.0,There's no schedule or spec; I work on what se...,,Less than once per month / Never,Home,A little below average,No,,"No, but I think we should",Developers typically have the most influence o...,I have little or no influence,C;C++;C#;Python;SQL,C;C#;JavaScript;SQL,MySQL;SQLite,MySQL;SQLite,Linux;Windows,Linux;Windows,,,.NET,.NET,Eclipse;Vim;Visual Studio;Visual Studio Code,Windows,I do not use containers,Not at all,"Useful for decentralized currency (i.e., Bitcoin)",Yes,SIGH,Yes,Reddit,In real life (in person),Username,2014,Daily or almost daily,Find answers to specific questions;Pass the ti...,1-2 times per week,Stack Overflow was much faster,31-60 minutes,Yes,Less than once per month or monthly,Yes,"No, and I don't know what those are","No, not really",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
13,I am a developer by profession,Yes,Less than once a month but more than once per ...,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,10 to 19 employees,Data or business analyst;Database administrato...,17,11,8,Very satisfied,Very satisfied,,,,I am not interested in new job opportunities,3-4 years ago,Complete a take-home project;Interview with pe...,Yes,"Languages, frameworks, and other technologies ...",I was preparing for a job search,USD,United States dollar,90000.0,Yearly,90000.0,40.0,There is a schedule and/or spec (made by me or...,"Meetings;Non-work commitments (parenting, scho...",All or almost all the time (I'm full-time remote),Home,A little above average,"Yes, because I see value in code review",5.0,"No, but I think we should",Developers and management have nearly equal in...,I have a great deal of influence,Bash/Shell/PowerShell;HTML/CSS;JavaScript;PHP;...,Bash/Shell/PowerShell;HTML/CSS;JavaScript;Rust...,Couchbase;DynamoDB;Firebase;MySQL,Firebase;MySQL;Redis,Android;AWS;Docker;IBM Cloud or Watson;iOS;Lin...,Android;AWS;Docker;IBM Cloud or Watson;Linux;S...,Angular/Angular.js;ASP.NET;Express;jQuery;Vue.js,Express;Vue.js,Node.js;Xamarin,Node.js;TensorFlow,Vim;Visual Studio;Visual Studio Code;Xcode,Windows,Development;Testing;Production,Not at all,"Useful for decentralized currency (i.e., Bitcoin)",Yes,Yes,Yes,Twitter,In real life (in person),Username,2011,Multiple times per day,Find answers to specific questions,More than 10 times per week,Stack Overflow was much faster,11-30 minutes,Yes,Less than once per month or monthly,Yes,"No, I've heard of them, but I am not part of a...",Neutral,Somewhat more welcome now than last year,Tech articles written by other developers;Cour...,28.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Easy
22,I am a developer by profession,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,Some college/university study without earning ...,,Taken an online course in programming or softw...,"10,000 or more employees","Data or business analyst;Designer;Developer, b...",35,12,18,Slightly satisfied,Very dissatisfied,Somewhat confident,No,No,"I’m not actively looking, but I am open to new...",More than 4 years ago,Interview with people in senior / management r...,No,Industry that I'd be working in;Financial perf...,I had a negative experience or interaction at ...,USD,United States dollar,103000.0,Yearly,103000.0,40.0,There is a schedule and/or spec (made by me or...,Being tasked with non-development work;Meeting...,"Less than half the time, but at least one day ...",Home,Average,No,,"No, but I think we should","The CTO, CIO, or other management purchase new...",I have little or no influence,Bash/Shell/PowerShell;C++;HTML/CSS;JavaScript;...,Bash/Shell/PowerShell;C++;HTML/CSS;JavaScript;...,Elasticsearch;MySQL;Oracle;Redis,Elasticsearch;MySQL;Oracle;Redis,Docker;Linux;Raspberry Pi;Windows,Docker;Linux;Raspberry Pi;Windows,Angular/Angular.js;Ruby on Rails,Angular/Angular.js;Ruby on Rails,Node.js,Node.js,Sublime Text;Visual Studio;Visual Studio Code,Windows,"Outside of work, for personal projects",Not at all,,Yes,Yes,Yes,Instagram,Online,Username,I don't remember,Daily or almost daily,Find answers to specific questions,3-5 times per week,Stack Overflow was much faster,0-10 minutes,Yes,A few times per week,Yes,"No, and I don't know what those are","Yes, somewhat",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,47.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Easy
23,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Information systems, information technology, o...",Taken an online course in programming or softw...,"10,000 or more employees","Developer, full-stack",3,19,1,Slightly satisfied,Slightly satisfied,Very confident,No,Not sure,"I’m not actively looking, but I am open to new...",Less than a year ago,"Write any code;Write code by hand (e.g., on a ...",No,Opportunities for professional development;How...,I was preparing for a job search,USD,United States dollar,69000.0,Yearly,69000.0,40.0,There is a schedule and/or spec (made by me or...,Distracting work environment;Meetings;Non-work...,A few days each month,Office,Average,"Yes, because I see value in code review",8.0,"Yes, it's part of our process",Developers and management have nearly equal in...,I have little or no influence,Bash/Shell/PowerShell;HTML/CSS;JavaScript;Pyth...,Bash/Shell/PowerShell;Go;HTML/CSS;Java;JavaScr...,Oracle;SQLite,Couchbase;DynamoDB;Elasticsearch;Firebase;Oracle,Docker;Google Cloud Platform,Docker;iOS;Slack,React.js;Ruby on Rails,Express;React.js;Ruby on Rails;Vue.js,,React Native;TensorFlow,Visual Studio Code,MacOS,Development;Testing;Production,,Useful for immutable record keeping outside of...,Yes,SIGH,Yes,Reddit,In real life (in person),Username,2014,Multiple times per day,Find answers to specific questions;Learn how t...,6-10 times per week,They were about the same,,Yes,I have never participated in Q&A on Stack Over...,Yes,"No, I've heard of them, but I am not part of a...","No, not really",Just as welcome now as I felt last year,Tech articles written by other developers;Tech...,22.0,Man,No,Straight / Heterosexual,Black or of African descent,No,Appropriate in length,Easy
26,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,United States,No,Some college/university study without earning ...,"Computer science, computer engineering, or sof...","Taught yourself a new language, framework, or ...","10,000 or more employees","Designer;Developer, back-end;Developer, deskto...",12,8,8,Very satisfied,Very satisfied,,,,"I’m not actively looking, but I am open to new...",Less than a year ago,Interview with people in peer roles;Interview ...,No,Remote work options;Diversity of the company o...,I was preparing for a job search,USD,United States dollar,114000.0,Yearly,114000.0,40.0,There is a schedule and/or spec (made by me or...,Being tasked with non-development work;Meeting...,"Less than half the time, but at least one day ...",Home,Far above average,"Yes, because I see value in code review",2.0,"Yes, it's not part of our process but the deve...",Developers typically have the most influence o...,I have a great deal of influence,Bash/Shell/PowerShell;C++;C#;HTML/CSS;JavaScri...,C#;HTML/CSS;JavaScript;Objective-C;Ruby;SQL;Sw...,Microsoft SQL Server;MySQL;Redis;SQLite,Microsoft SQL Server;MySQL;Redis;SQLite,AWS;Docker;Linux;MacOS;Microsoft Azure;Windows...,Android;Docker;iOS;Linux;MacOS;Microsoft Azure...,Angular/Angular.js;ASP.NET;Drupal;Express;jQue...,Angular/Angular.js;ASP.NET,.NET;.NET Core;Node.js;Xamarin,.NET;.NET Core;Node.js,Notepad++;Sublime Text;Vim;Visual Studio;Xcode,MacOS,Development;Testing,Not at all,A passing fad,Yes,SIGH,Yes,I don't use social media,In real life (in person),Username,2008,Daily or almost daily,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was much faster,11-30 minutes,Yes,Less than once per month or monthly,Yes,"No, I've heard of them, but I am not part of a...",Neutral,Just as welcome now as I felt last year,,34.0,Man,No,Gay or Lesbian,,No,Appropriate in length,Easy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78292,,No,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...","Independent contractor, freelancer, or self-em...",United States,No,"Other doctoral degree (Ph.D, Ed.D., etc.)","A health science (ex. nursing, pharmacy, radio...",Completed an industry certification program (e...,"Just me - I am a freelancer, sole proprietor, ...",Academic researcher,42,14,31,,,,,,,,,,,,,,,,,,,,,,,,,,,,Bash/Shell/PowerShell;C;Python,Bash/Shell/PowerShell;C;Python,SQLite,SQLite,Linux;Raspberry Pi;Other(s):,Linux;Raspberry Pi;Other(s):,,,Chef,,Emacs;IPython / Jupyter,Linux-based,I do not use containers,,Useful for immutable record keeping outside of...,No,Yes,Yes,I don't use social media,In real life (in person),,2013,A few times per week,Find answers to specific questions,Less than once per week,The other resource was slightly faster,11-30 minutes,Not sure / can't remember,,"No, I didn't know that Stack Overflow had a jo...","No, and I don't know what those are","No, not really",Somewhat less welcome now than last year,,60.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Too long,Neither easy nor difficult
82717,,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",United States,No,"Secondary school (e.g. American high school, G...",,,,,Less than 1 year,,Less than 1 year,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Android;Windows,Android;Microsoft Azure;Windows,,,,,,MacOS,Testing,,,No,SIGH,Yes,Facebook,In real life (in person),Username,2018,Less than once per month or monthly,Find answers to specific questions,Less than once per week,,60+ minutes,No,,"No, I knew that Stack Overflow had a job board...","No, I've heard of them, but I am not part of a...",Not sure,,Industry news about technologies you're intere...,44.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Neither easy nor difficult
83397,,Yes,Less than once per year,,"Not employed, but looking for work",United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,,,12,9,Less than 1 year,,,,,,,,,,,,,,,,,,,,,,,,,,,,HTML/CSS;JavaScript;Python;SQL,C;C++;C#;Go;Java;JavaScript;Python;R;Ruby;SQL;...,,,Android;Arduino;Slack,Android;Arduino;Docker;iOS;Raspberry Pi;Slack,Flask,Django;Drupal;Flask;jQuery;React.js,,Chef;Torch/PyTorch,Eclipse;IPython / Jupyter;Sublime Text,MacOS,I do not use containers,,,,SIGH,Yes,,,Handle,I don't remember,A few times per week,Find answers to specific questions;Learn how t...,3-5 times per week,They were about the same,,Not sure / can't remember,,Yes,"No, and I don't know what those are","No, not at all",Just as welcome now as I felt last year,,27.0,Woman,No,Bisexual,White or of European descent,No,Appropriate in length,Easy
85642,,No,Less than once per year,"OSS is, on average, of LOWER quality than prop...","Independent contractor, freelancer, or self-em...",United States,No,Associate degree,"Information systems, information technology, o...",Taken an online course in programming or softw...,"Just me - I am a freelancer, sole proprietor, ...",Designer;Marketing or sales professional,20,7,Less than 1 year,,,,,,,,,,,,,,,,,,,,,,,,,,,,Go;HTML/CSS,,,,,,,,,,Visual Studio Code,Windows,I do not use containers,,Useful for immutable record keeping outside of...,No,SIGH,Yes,,In real life (in person),Handle,2008,Less than once per month or monthly,Find answers to specific questions,Less than once per week,Stack Overflow was slightly faster,60+ minutes,Yes,I have never participated in Q&A on Stack Over...,"No, I knew that Stack Overflow had a job board...","No, and I don't know what those are","No, not at all",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,34.0,"Non-binary, genderqueer, or gender non-conforming",,Bisexual;Gay or Lesbian,White or of European descent,No,Appropriate in length,Easy


Dacă rulăm codul și preluăm acel grup pentru valoarea 'United Stated' putem observa că ni se afișează un data frame. Dacă ne uităm la acel data frame ăn cadrul coloanei 'Country' putem observa că toate valorile sunt 'United States'. Asta conține acest obiect de tipul 'DataFrameGroupBy', a împărțit data frame-ul în grupe în funcție de numele țării. Putem spune că este similar precum rularea unui filtru pe întreg-ul data frame. Ar trebuie să obține același rezultat pentru fiecare țară în parte dacă am crea un filtru și l-am aplica

In [32]:
filter_df = df['Country'] == 'United States'

In [33]:
df.loc[filter_df]

Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,OrgSize,DevType,YearsCode,Age1stCode,YearsCodePro,CareerSat,JobSat,MgrIdiot,MgrMoney,MgrWant,JobSeek,LastHireDate,LastInt,FizzBuzz,JobFactors,ResumeUpdate,CurrencySymbol,CurrencyDesc,CompTotal,CompFreq,ConvertedComp,WorkWeekHrs,WorkPlan,WorkChallenge,WorkRemote,WorkLoc,ImpSyn,CodeRev,CodeRevHrs,UnitTests,PurchaseHow,PurchaseWhat,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,WebFrameWorkedWith,WebFrameDesireNextYear,MiscTechWorkedWith,MiscTechDesireNextYear,DevEnviron,OpSys,Containers,BlockchainOrg,BlockchainIs,BetterLife,ITperson,OffOn,SocialMedia,Extraversion,ScreenName,SOVisit1st,SOVisitFreq,SOVisitTo,SOFindAnswer,SOTimeSaved,SOHowMuchTime,SOAccount,SOPartFreq,SOJobs,EntTeams,SOComm,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1
4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,100 to 499 employees,"Developer, full-stack",3,16,Less than 1 year,Very satisfied,Slightly satisfied,Very confident,No,Not sure,I am not interested in new job opportunities,Less than a year ago,"Write code by hand (e.g., on a whiteboard);Int...",No,"Languages, frameworks, and other technologies ...",I was preparing for a job search,USD,United States dollar,61000.0,Yearly,61000.0,80.0,There's no schedule or spec; I work on what se...,,Less than once per month / Never,Home,A little below average,No,,"No, but I think we should",Developers typically have the most influence o...,I have little or no influence,C;C++;C#;Python;SQL,C;C#;JavaScript;SQL,MySQL;SQLite,MySQL;SQLite,Linux;Windows,Linux;Windows,,,.NET,.NET,Eclipse;Vim;Visual Studio;Visual Studio Code,Windows,I do not use containers,Not at all,"Useful for decentralized currency (i.e., Bitcoin)",Yes,SIGH,Yes,Reddit,In real life (in person),Username,2014,Daily or almost daily,Find answers to specific questions;Pass the ti...,1-2 times per week,Stack Overflow was much faster,31-60 minutes,Yes,Less than once per month or monthly,Yes,"No, and I don't know what those are","No, not really",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
13,I am a developer by profession,Yes,Less than once a month but more than once per ...,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,10 to 19 employees,Data or business analyst;Database administrato...,17,11,8,Very satisfied,Very satisfied,,,,I am not interested in new job opportunities,3-4 years ago,Complete a take-home project;Interview with pe...,Yes,"Languages, frameworks, and other technologies ...",I was preparing for a job search,USD,United States dollar,90000.0,Yearly,90000.0,40.0,There is a schedule and/or spec (made by me or...,"Meetings;Non-work commitments (parenting, scho...",All or almost all the time (I'm full-time remote),Home,A little above average,"Yes, because I see value in code review",5.0,"No, but I think we should",Developers and management have nearly equal in...,I have a great deal of influence,Bash/Shell/PowerShell;HTML/CSS;JavaScript;PHP;...,Bash/Shell/PowerShell;HTML/CSS;JavaScript;Rust...,Couchbase;DynamoDB;Firebase;MySQL,Firebase;MySQL;Redis,Android;AWS;Docker;IBM Cloud or Watson;iOS;Lin...,Android;AWS;Docker;IBM Cloud or Watson;Linux;S...,Angular/Angular.js;ASP.NET;Express;jQuery;Vue.js,Express;Vue.js,Node.js;Xamarin,Node.js;TensorFlow,Vim;Visual Studio;Visual Studio Code;Xcode,Windows,Development;Testing;Production,Not at all,"Useful for decentralized currency (i.e., Bitcoin)",Yes,Yes,Yes,Twitter,In real life (in person),Username,2011,Multiple times per day,Find answers to specific questions,More than 10 times per week,Stack Overflow was much faster,11-30 minutes,Yes,Less than once per month or monthly,Yes,"No, I've heard of them, but I am not part of a...",Neutral,Somewhat more welcome now than last year,Tech articles written by other developers;Cour...,28.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Easy
22,I am a developer by profession,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,Some college/university study without earning ...,,Taken an online course in programming or softw...,"10,000 or more employees","Data or business analyst;Designer;Developer, b...",35,12,18,Slightly satisfied,Very dissatisfied,Somewhat confident,No,No,"I’m not actively looking, but I am open to new...",More than 4 years ago,Interview with people in senior / management r...,No,Industry that I'd be working in;Financial perf...,I had a negative experience or interaction at ...,USD,United States dollar,103000.0,Yearly,103000.0,40.0,There is a schedule and/or spec (made by me or...,Being tasked with non-development work;Meeting...,"Less than half the time, but at least one day ...",Home,Average,No,,"No, but I think we should","The CTO, CIO, or other management purchase new...",I have little or no influence,Bash/Shell/PowerShell;C++;HTML/CSS;JavaScript;...,Bash/Shell/PowerShell;C++;HTML/CSS;JavaScript;...,Elasticsearch;MySQL;Oracle;Redis,Elasticsearch;MySQL;Oracle;Redis,Docker;Linux;Raspberry Pi;Windows,Docker;Linux;Raspberry Pi;Windows,Angular/Angular.js;Ruby on Rails,Angular/Angular.js;Ruby on Rails,Node.js,Node.js,Sublime Text;Visual Studio;Visual Studio Code,Windows,"Outside of work, for personal projects",Not at all,,Yes,Yes,Yes,Instagram,Online,Username,I don't remember,Daily or almost daily,Find answers to specific questions,3-5 times per week,Stack Overflow was much faster,0-10 minutes,Yes,A few times per week,Yes,"No, and I don't know what those are","Yes, somewhat",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,47.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Easy
23,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Information systems, information technology, o...",Taken an online course in programming or softw...,"10,000 or more employees","Developer, full-stack",3,19,1,Slightly satisfied,Slightly satisfied,Very confident,No,Not sure,"I’m not actively looking, but I am open to new...",Less than a year ago,"Write any code;Write code by hand (e.g., on a ...",No,Opportunities for professional development;How...,I was preparing for a job search,USD,United States dollar,69000.0,Yearly,69000.0,40.0,There is a schedule and/or spec (made by me or...,Distracting work environment;Meetings;Non-work...,A few days each month,Office,Average,"Yes, because I see value in code review",8.0,"Yes, it's part of our process",Developers and management have nearly equal in...,I have little or no influence,Bash/Shell/PowerShell;HTML/CSS;JavaScript;Pyth...,Bash/Shell/PowerShell;Go;HTML/CSS;Java;JavaScr...,Oracle;SQLite,Couchbase;DynamoDB;Elasticsearch;Firebase;Oracle,Docker;Google Cloud Platform,Docker;iOS;Slack,React.js;Ruby on Rails,Express;React.js;Ruby on Rails;Vue.js,,React Native;TensorFlow,Visual Studio Code,MacOS,Development;Testing;Production,,Useful for immutable record keeping outside of...,Yes,SIGH,Yes,Reddit,In real life (in person),Username,2014,Multiple times per day,Find answers to specific questions;Learn how t...,6-10 times per week,They were about the same,,Yes,I have never participated in Q&A on Stack Over...,Yes,"No, I've heard of them, but I am not part of a...","No, not really",Just as welcome now as I felt last year,Tech articles written by other developers;Tech...,22.0,Man,No,Straight / Heterosexual,Black or of African descent,No,Appropriate in length,Easy
26,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,United States,No,Some college/university study without earning ...,"Computer science, computer engineering, or sof...","Taught yourself a new language, framework, or ...","10,000 or more employees","Designer;Developer, back-end;Developer, deskto...",12,8,8,Very satisfied,Very satisfied,,,,"I’m not actively looking, but I am open to new...",Less than a year ago,Interview with people in peer roles;Interview ...,No,Remote work options;Diversity of the company o...,I was preparing for a job search,USD,United States dollar,114000.0,Yearly,114000.0,40.0,There is a schedule and/or spec (made by me or...,Being tasked with non-development work;Meeting...,"Less than half the time, but at least one day ...",Home,Far above average,"Yes, because I see value in code review",2.0,"Yes, it's not part of our process but the deve...",Developers typically have the most influence o...,I have a great deal of influence,Bash/Shell/PowerShell;C++;C#;HTML/CSS;JavaScri...,C#;HTML/CSS;JavaScript;Objective-C;Ruby;SQL;Sw...,Microsoft SQL Server;MySQL;Redis;SQLite,Microsoft SQL Server;MySQL;Redis;SQLite,AWS;Docker;Linux;MacOS;Microsoft Azure;Windows...,Android;Docker;iOS;Linux;MacOS;Microsoft Azure...,Angular/Angular.js;ASP.NET;Drupal;Express;jQue...,Angular/Angular.js;ASP.NET,.NET;.NET Core;Node.js;Xamarin,.NET;.NET Core;Node.js,Notepad++;Sublime Text;Vim;Visual Studio;Xcode,MacOS,Development;Testing,Not at all,A passing fad,Yes,SIGH,Yes,I don't use social media,In real life (in person),Username,2008,Daily or almost daily,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was much faster,11-30 minutes,Yes,Less than once per month or monthly,Yes,"No, I've heard of them, but I am not part of a...",Neutral,Just as welcome now as I felt last year,,34.0,Man,No,Gay or Lesbian,,No,Appropriate in length,Easy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78292,,No,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...","Independent contractor, freelancer, or self-em...",United States,No,"Other doctoral degree (Ph.D, Ed.D., etc.)","A health science (ex. nursing, pharmacy, radio...",Completed an industry certification program (e...,"Just me - I am a freelancer, sole proprietor, ...",Academic researcher,42,14,31,,,,,,,,,,,,,,,,,,,,,,,,,,,,Bash/Shell/PowerShell;C;Python,Bash/Shell/PowerShell;C;Python,SQLite,SQLite,Linux;Raspberry Pi;Other(s):,Linux;Raspberry Pi;Other(s):,,,Chef,,Emacs;IPython / Jupyter,Linux-based,I do not use containers,,Useful for immutable record keeping outside of...,No,Yes,Yes,I don't use social media,In real life (in person),,2013,A few times per week,Find answers to specific questions,Less than once per week,The other resource was slightly faster,11-30 minutes,Not sure / can't remember,,"No, I didn't know that Stack Overflow had a jo...","No, and I don't know what those are","No, not really",Somewhat less welcome now than last year,,60.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Too long,Neither easy nor difficult
82717,,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",United States,No,"Secondary school (e.g. American high school, G...",,,,,Less than 1 year,,Less than 1 year,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Android;Windows,Android;Microsoft Azure;Windows,,,,,,MacOS,Testing,,,No,SIGH,Yes,Facebook,In real life (in person),Username,2018,Less than once per month or monthly,Find answers to specific questions,Less than once per week,,60+ minutes,No,,"No, I knew that Stack Overflow had a job board...","No, I've heard of them, but I am not part of a...",Not sure,,Industry news about technologies you're intere...,44.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Neither easy nor difficult
83397,,Yes,Less than once per year,,"Not employed, but looking for work",United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,,,12,9,Less than 1 year,,,,,,,,,,,,,,,,,,,,,,,,,,,,HTML/CSS;JavaScript;Python;SQL,C;C++;C#;Go;Java;JavaScript;Python;R;Ruby;SQL;...,,,Android;Arduino;Slack,Android;Arduino;Docker;iOS;Raspberry Pi;Slack,Flask,Django;Drupal;Flask;jQuery;React.js,,Chef;Torch/PyTorch,Eclipse;IPython / Jupyter;Sublime Text,MacOS,I do not use containers,,,,SIGH,Yes,,,Handle,I don't remember,A few times per week,Find answers to specific questions;Learn how t...,3-5 times per week,They were about the same,,Not sure / can't remember,,Yes,"No, and I don't know what those are","No, not at all",Just as welcome now as I felt last year,,27.0,Woman,No,Bisexual,White or of European descent,No,Appropriate in length,Easy
85642,,No,Less than once per year,"OSS is, on average, of LOWER quality than prop...","Independent contractor, freelancer, or self-em...",United States,No,Associate degree,"Information systems, information technology, o...",Taken an online course in programming or softw...,"Just me - I am a freelancer, sole proprietor, ...",Designer;Marketing or sales professional,20,7,Less than 1 year,,,,,,,,,,,,,,,,,,,,,,,,,,,,Go;HTML/CSS,,,,,,,,,,Visual Studio Code,Windows,I do not use containers,,Useful for immutable record keeping outside of...,No,SIGH,Yes,,In real life (in person),Handle,2008,Less than once per month or monthly,Find answers to specific questions,Less than once per week,Stack Overflow was slightly faster,60+ minutes,Yes,I have never participated in Q&A on Stack Over...,"No, I knew that Stack Overflow had a job board...","No, and I don't know what those are","No, not at all",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,34.0,"Non-binary, genderqueer, or gender non-conforming",,Bisexual;Gay or Lesbian,White or of European descent,No,Appropriate in length,Easy


Dacă dorim să aflăm informații doar despre o anumită țară, atunci procedeul de grupare este foarte asemănător cu cel de filtrare din moment ce retunează aceleași date. În loc să ne returneze rezultatele pentru o singură țară, metoda 'groupby()' splituiește toate răspunsurile după numele țării. Acesta reprezintă primul pas din partea de grupare, și anume splituriea. Din moment ce avem aceste răspunsuri împărțite acuma putem să trecem la pasul doi, acela de a aplica o funcție.

Dacă dorim de exemplu să aflăm cel mai popular site de socializare din Statele Unite, pentru început o să face asta prin utilizarea filtrului de mai sus. Pentur filtrul respectiv o să apelăm metoda 'value_counts()'

In [35]:
df.loc[filter_df]['SocialMedia'].value_counts(normalize=True)

Reddit                      0.284346
Twitter                     0.173002
Facebook                    0.141874
YouTube                     0.122867
I don't use social media    0.092338
Instagram                   0.082410
LinkedIn                    0.050883
WhatsApp                    0.030380
Snapchat                    0.016263
WeChat 微信                   0.004639
VK ВКонта́кте               0.000449
Weibo 新浪微博                  0.000399
Hello                       0.000100
Youku Tudou 优酷              0.000050
Name: SocialMedia, dtype: float64

Acestea sunt valorile doar pentru o țară anume. Dacă ar fi să utilizăm metoda 'value_counts()' pentru obiectul de tip 'DataFrameGroupBy', atunci o să ne calculeze aceste valori pentru fiecare grup în parte. Acel obiect are toate coloanele prezente din data frame-ul mare. Dacă dorim să vizualizăm datele dintr-o coloană, atunci o să ne afișeze doar faptul că este un obiect de tip 'SeriesGroupBy'

In [36]:
groupby_country_df['SocialMedia']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7faf5088bc70>

Pentru a vedea toate datele pentru fiecare grup în parte despre coloana 'SocialMedia', atunci pentru acea coloană putem să aplicăm metoda 'value_counts()'

In [37]:
groupby_country_df['SocialMedia'].value_counts()

Country      SocialMedia             
Afghanistan  Facebook                    15
             YouTube                      9
             I don't use social media     6
             WhatsApp                     4
             Instagram                    1
                                         ..
Zimbabwe     Facebook                     3
             YouTube                      3
             Instagram                    2
             LinkedIn                     2
             Reddit                       1
Name: SocialMedia, Length: 1220, dtype: int64

Din moment ce am aplicat acea metodă pentru grupul de date, acesta returnează un obiect de tip Series în care sunt prezente individual, pentru fiecare țară în parte numărul de răspunsuri pentru fiecare valoare din coloana 'SocialMedia'

In [38]:
groupby_country_df['SocialMedia'].value_counts().head(50)

Country              SocialMedia             
Afghanistan          Facebook                     15
                     YouTube                       9
                     I don't use social media      6
                     WhatsApp                      4
                     Instagram                     1
                     LinkedIn                      1
                     Twitter                       1
Albania              WhatsApp                     18
                     Facebook                     16
                     Instagram                    13
                     YouTube                      10
                     Twitter                       8
                     LinkedIn                      7
                     Reddit                        6
                     I don't use social media      4
                     Snapchat                      1
                     WeChat 微信                     1
Algeria              YouTube                      42


După cum spuneam, ceea ce este returnat este un obiect de tip Series, obiect care are mai mulți indexi, și anume 'Country' repsectiv 'SocialMedia'. Primul index este cel pentru 'Country', iar fiind un index putem să preluăm acest index utilizând indexatorul 'loc'. Acest indexator trebuie apleat după ce se rulează metoda 'value_counts()'

In [39]:
groupby_country_df['SocialMedia'].value_counts().loc['United States']

SocialMedia
Reddit                      5700
Twitter                     3468
Facebook                    2844
YouTube                     2463
I don't use social media    1851
Instagram                   1652
LinkedIn                    1020
WhatsApp                     609
Snapchat                     326
WeChat 微信                     93
VK ВКонта́кте                  9
Weibo 新浪微博                     8
Hello                          2
Youku Tudou 优酷                 1
Name: SocialMedia, dtype: int64

In [40]:
groupby_country_df['SocialMedia'].value_counts().loc['India']

SocialMedia
WhatsApp                    2990
YouTube                     1820
LinkedIn                     955
Facebook                     841
Instagram                    822
Twitter                      542
Reddit                       473
I don't use social media     250
Snapchat                      23
Hello                          5
WeChat 微信                      5
VK ВКонта́кте                  4
Youku Tudou 优酷                 2
Weibo 新浪微博                     1
Name: SocialMedia, dtype: int64

Dacă dorim să aflăm date despre anumite țări rapid, este destul să schimbăm numele index-ului pe care dorim să îl extragem. De aceasta este destul de folositoare partea de grupare a datelor

În acest moment putem să utilizăm metode de agregare peste aceste grupuri (precum 'median()', 'mean()' ). Pentru a vedea salariile medii din fiecare țară prezentă în studiu trebuie să accesăm coloana 'ConvertedComp' iar pentru aceasta să rulăm metoda 'median()'

In [41]:
groupby_country_df['ConvertedComp'].median()

Country
Afghanistan                               6222.0
Albania                                  10818.0
Algeria                                   7878.0
Andorra                                 160931.0
Angola                                    7764.0
                                          ...   
Venezuela, Bolivarian Republic of...      6384.0
Viet Nam                                 11892.0
Yemen                                    11940.0
Zambia                                    5040.0
Zimbabwe                                 19200.0
Name: ConvertedComp, Length: 179, dtype: float64

Rezultatul este un obiect de tip Series unde index-ul reprezintă un nume de țară. Dacă dorim să aflăm salariul mediu dintr-o anumită țară putem să utilizăm indexatorul 'loc' pentru a extrage acel index

In [42]:
groupby_country_df['ConvertedComp'].median().loc['Germany']

63016.0

Să trecem acuma peste cazul în care lucrăm cu ceva anilze și dorim să grupăm datele, dar dorim și să rulăm mai multe date de agregare pentru grupurile respective. Să luăm cazul în care dorim să apelăm atât metoda 'median()' cât și cea de 'mean()'. Pentru a face acest lucru putem să ne folosim de metoda `agg()` la care să îi pasăm un număr de metode ce dorim să le aplicăm. Metodele trebuie specificate într-o listă sub formă de string

In [43]:
groupby_country_df['ConvertedComp'].agg(['median', 'mean'])

Unnamed: 0_level_0,median,mean
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,6222.0,101953.333333
Albania,10818.0,21833.700000
Algeria,7878.0,34924.047619
Andorra,160931.0,160931.000000
Angola,7764.0,7764.000000
...,...,...
"Venezuela, Bolivarian Republic of...",6384.0,14581.627907
Viet Nam,11892.0,17233.436782
Yemen,11940.0,16909.166667
Zambia,5040.0,10075.375000


În funcție de ce anume dorim să realizăm, e posibil să ne batem de ceva erori care nu ne-am aștepta să se întâmple. Să luăm cazul în care încercăm să ne dăm seama câte persoane din fiecare țară știu Python ca limbaj de programare. Înainte de a ne uita cum putem face asta pentru un grup, o să realizăm acest pas pentru o singură coloană utilizând partea de filtrare.

In [44]:
filter_df = df['Country'] == 'India'

In [45]:
df.loc[filter_df]

Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,OrgSize,DevType,YearsCode,Age1stCode,YearsCodePro,CareerSat,JobSat,MgrIdiot,MgrMoney,MgrWant,JobSeek,LastHireDate,LastInt,FizzBuzz,JobFactors,ResumeUpdate,CurrencySymbol,CurrencyDesc,CompTotal,CompFreq,ConvertedComp,WorkWeekHrs,WorkPlan,WorkChallenge,WorkRemote,WorkLoc,ImpSyn,CodeRev,CodeRevHrs,UnitTests,PurchaseHow,PurchaseWhat,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,WebFrameWorkedWith,WebFrameDesireNextYear,MiscTechWorkedWith,MiscTechDesireNextYear,DevEnviron,OpSys,Containers,BlockchainOrg,BlockchainIs,BetterLife,ITperson,OffOn,SocialMedia,Extraversion,ScreenName,SOVisit1st,SOVisitFreq,SOVisitTo,SOFindAnswer,SOTimeSaved,SOHowMuchTime,SOAccount,SOPartFreq,SOJobs,EntTeams,SOComm,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1
8,I code primarily as a hobby,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...","Not employed, but looking for work",India,,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...","Taught yourself a new language, framework, or ...",,"Developer, back-end;Engineer, site reliability",8,16,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Bash/Shell/PowerShell;C;C++;HTML/CSS;Java;Java...,Bash/Shell/PowerShell;C;C++;Elixir;Erlang;Go;P...,Cassandra;Elasticsearch;MongoDB;MySQL;Oracle;R...,Cassandra;DynamoDB;Elasticsearch;Firebase;Mong...,AWS;Docker;Heroku;Linux;MacOS;Slack,Android;Arduino;AWS;Docker;Google Cloud Platfo...,Express;Flask;React.js;Spring,Django;Express;Flask;React.js;Vue.js,Hadoop;Node.js;Pandas,Ansible;Apache Spark;Chef;Hadoop;Node.js;Panda...,Atom;IntelliJ;IPython / Jupyter;PyCharm;Visual...,Linux-based,Development;Testing;Production;Outside of work...,,Useful across many domains and could change ma...,Yes,SIGH,Yes,YouTube,In real life (in person),Handle,2012,A few times per week,Find answers to specific questions;Learn how t...,Less than once per week,Stack Overflow was slightly faster,11-30 minutes,Yes,Less than once per month or monthly,Yes,"No, and I don't know what those are","Yes, definitely",A lot more welcome now than last year,Tech articles written by other developers;Indu...,24.0,Man,No,Straight / Heterosexual,,,Appropriate in length,Neither easy nor difficult
10,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,India,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)",,,"10,000 or more employees",Data or business analyst;Data scientist or mac...,12,20,10,Slightly dissatisfied,Slightly dissatisfied,Somewhat confident,Yes,Yes,"I’m not actively looking, but I am open to new...",3-4 years ago,,No,"Languages, frameworks, and other technologies ...",,INR,Indian rupee,950000.0,Yearly,13293.0,70.0,There's no schedule or spec; I work on what se...,,A few days each month,Home,Far above average,"Yes, because I see value in code review",4.0,"Yes, it's part of our process",,,C#;Go;JavaScript;Python;R;SQL,C#;Go;JavaScript;Kotlin;Python;R;SQL,Elasticsearch;MongoDB;Microsoft SQL Server;MyS...,Elasticsearch;MongoDB;Microsoft SQL Server,Linux;Windows,Android;Linux;Raspberry Pi;Windows,Angular/Angular.js;ASP.NET;Django;Express;Flas...,Angular/Angular.js;ASP.NET;Django;Express;Flas...,.NET;Node.js;Pandas;Torch/PyTorch,.NET;Node.js;TensorFlow;Torch/PyTorch,Android Studio;Eclipse;IPython / Jupyter;Notep...,Windows,,Not at all,Useful for immutable record keeping outside of...,No,Yes,Yes,YouTube,Neither,Screen Name,,Multiple times per day,Find answers to specific questions;Get a sense...,3-5 times per week,They were about the same,,Yes,A few times per month or weekly,Yes,"No, and I don't know what those are","Yes, somewhat",Somewhat less welcome now than last year,Tech articles written by other developers;Tech...,,,,,,Yes,Too long,Difficult
15,I am a student who is learning to code,Yes,Never,"OSS is, on average, of HIGHER quality than pro...","Not employed, but looking for work",India,"Yes, full-time","Secondary school (e.g. American high school, G...",,Taken an online course in programming or softw...,,Student,3,13,,,,,,,"I’m not actively looking, but I am open to new...",I've never had a job,,,"Industry that I'd be working in;Languages, fra...","Something else changed (education, award, medi...",,,,,,,,,,,,,,,,,Assembly;Bash/Shell/PowerShell;C;C++;HTML/CSS;...,Assembly;Bash/Shell/PowerShell;C;C++;C#;Go;HTM...,MariaDB;MySQL;Oracle;SQLite,MariaDB;MongoDB;Microsoft SQL Server;MySQL;Ora...,Linux;Windows,Android;Google Cloud Platform;iOS;Linux;MacOS;...,,Angular/Angular.js;ASP.NET;Django;Drupal;jQuer...,,.NET;.NET Core;Node.js;TensorFlow;Unity 3D;Unr...,Atom;NetBeans;Notepad++;Sublime Text;Vim,Linux-based,Development,,,Yes,Yes,What?,YouTube,In real life (in person),,2018,Daily or almost daily,Find answers to specific questions;Learn how t...,More than 10 times per week,They were about the same,,Yes,Less than once per month or monthly,Yes,"No, I've heard of them, but I am not part of a...","Yes, somewhat",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,20.0,Man,No,,,Yes,Too long,Neither easy nor difficult
50,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of LOWER quality than prop...",Employed full-time,India,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Another engineering discipline (ex. civil, ele...",Received on-the-job training in software devel...,"10,000 or more employees","Developer, back-end;DevOps specialist",7,15,2,Slightly satisfied,Very satisfied,Very confident,Not sure,Yes,"I’m not actively looking, but I am open to new...",1-2 years ago,"Write code by hand (e.g., on a whiteboard);Int...",No,Specific department or team I'd be working on;...,I was preparing for a job search,INR,Indian rupee,400000.0,Yearly,5597.0,7.0,There is a schedule and/or spec (made by me or...,Meetings;Time spent commuting,Less than once per month / Never,"Other place, such as a coworking space or cafe",Average,No,,"Yes, it's not part of our process but the deve...","The CTO, CIO, or other management purchase new...",I have little or no influence,Bash/Shell/PowerShell;C;C++;HTML/CSS;Java;Java...,HTML/CSS;JavaScript;Python,Elasticsearch;Firebase;MariaDB;MongoDB;MySQL;O...,Firebase;PostgreSQL;Redis;Other(s):,Arduino;AWS;Heroku;Linux;MacOS;Raspberry Pi;Wo...,AWS;Docker;Heroku;Kubernetes;Linux;MacOS;WordP...,Django;Express;Flask;jQuery,Express;Flask;jQuery;React.js;Vue.js,Node.js,Node.js,Notepad++;Visual Studio Code,MacOS,Testing,Not at all,Useful for immutable record keeping outside of...,Yes,Also Yes,What?,YouTube,In real life (in person),Username,2012,Daily or almost daily,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was slightly faster,11-30 minutes,Yes,Less than once per month or monthly,"No, I knew that Stack Overflow had a job board...","No, and I don't know what those are","Yes, definitely",Just as welcome now as I felt last year,Tech articles written by other developers;Tech...,23.0,Man,No,,South Asian,No,Too long,Easy
65,I am a developer by profession,Yes,Never,,Employed full-time,India,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Information systems, information technology, o...",,20 to 99 employees,"Developer, front-end;Developer, mobile",2,17,2,Very satisfied,Very satisfied,Very confident,No,Not sure,"I’m not actively looking, but I am open to new...",Less than a year ago,Write any code;Solve a brain-teaser style puzz...,No,"Languages, frameworks, and other technologies ...","My job status changed (promotion, new job, etc.)",INR,Indian rupee,,Monthly,,48.0,There's no schedule or spec; I work on what se...,,About half the time,Office,Average,"Yes, because I see value in code review",,"Yes, it's not part of our process but the deve...",Not sure,,Assembly;C;C++;C#;HTML/CSS;Java,Kotlin,Firebase;MySQL;Oracle;SQLite,Firebase;SQLite,Android,Android,ASP.NET,,,,Android Studio;IntelliJ,Linux-based,,,,Yes,Yes,What?,WhatsApp,In real life (in person),,2017,Multiple times per day,Find answers to specific questions,More than 10 times per week,Stack Overflow was slightly faster,11-30 minutes,Yes,A few times per week,"No, I knew that Stack Overflow had a job board...","No, and I don't know what those are",Not sure,A lot more welcome now than last year,,21.0,Man,No,,,Yes,Appropriate in length,Neither easy nor difficult
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77339,,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,India,"Yes, full-time","Bachelor’s degree (BA, BS, B.Eng., etc.)","Another engineering discipline (ex. civil, ele...",Taken an online course in programming or softw...,"1,000 to 4,999 employees",,1,27,1,,,Somewhat confident,Yes,Yes,,,,,,,,,,,,,,,,,,,,,,,Other(s):,Python;SQL,,,,,,,,,,Linux-based,I do not use containers,,,Yes,Yes,No,YouTube,Online,UserID,2019,Less than once per month or monthly,Find answers to specific questions;Learn how t...,1-2 times per week,Stack Overflow was much faster,31-60 minutes,Yes,I have never participated in Q&A on Stack Over...,"No, I didn't know that Stack Overflow had a jo...","No, I've heard of them, but I am not part of a...","Yes, somewhat",Not applicable - I did not use Stack Overflow ...,Tech articles written by other developers;Indu...,,,,,,,,
79795,,Yes,Less than once a month but more than once per ...,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,India,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,500 to 999 employees,"Developer, QA or test",6,17,5,,,Somewhat confident,No,Not sure,,,,,,,,,,,,,,,,,,,,,,,Bash/Shell/PowerShell;Python;SQL;VBA,,,,,,,,Apache Spark;Chef;Puppet,,PyCharm;Vim,Linux-based,Development;Testing;Production;Outside of work...,Not at all,,No,Yes,What?,Instagram,In real life (in person),Username,2018,A few times per month or weekly,Find answers to specific questions,Less than once per week,They were about the same,,Yes,I have never participated in Q&A on Stack Over...,"No, I didn't know that Stack Overflow had a jo...","No, and I don't know what those are",Neutral,Somewhat more welcome now than last year,Tech meetups or events in your area;Courses on...,,Man,No,Straight / Heterosexual,,No,Too long,Difficult
83862,,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,India,"Yes, full-time","Bachelor’s degree (BA, BS, B.Eng., etc.)",,Participated in a hackathon,,Data or business analyst;Student,1,18,Less than 1 year,,,Very confident,Not sure,Yes,,,,,,,,,,,,,,,,,,,,,,,Assembly;C;C++;HTML/CSS;Java;JavaScript;Object...,,MySQL,DynamoDB;Elasticsearch;MongoDB,Android;AWS;Google Cloud Platform;WordPress,IBM Cloud or Watson,Laravel,Angular/Angular.js;Laravel;Vue.js,,Node.js,Android Studio;Atom;IntelliJ;Komodo;NetBeans;N...,Windows,I do not use containers,,Useful across many domains and could change ma...,No,Yes,What?,Twitter,Online,UserID,2012,A few times per month or weekly,Find answers to specific questions;Learn how t...,Less than once per week,Stack Overflow was much faster,0-10 minutes,Yes,I have never participated in Q&A on Stack Over...,Yes,"No, and I don't know what those are","Yes, definitely",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,24.0,Man,No,Straight / Heterosexual,,Yes,Too long,Neither easy nor difficult
84299,,Yes,Never,The quality of OSS and closed source software ...,Employed full-time,India,"Yes, full-time","Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,Taken an online course in programming or softw...,100 to 499 employees,"Developer, back-end;Developer, front-end;Devel...",12,25,12,,,,,,,,,,,,,,,,,,,,,,,,,,,,HTML/CSS;Java;JavaScript;Python;Swift;TypeScript,HTML/CSS;Java;JavaScript,MongoDB;Redis;SQLite,MongoDB;Redis,,,Angular/Angular.js;Express;jQuery;React.js;Oth...,Express;React.js;Other(s):,Node.js,Node.js,IntelliJ;Notepad++;Visual Studio Code;Xcode,Windows,,,,Yes,"Fortunately, someone else has that title",What?,LinkedIn,,,2011,A few times per month or weekly,Find answers to specific questions;Contribute ...,Less than once per week,Stack Overflow was much faster,60+ minutes,Yes,Less than once per month or monthly,Yes,"No, and I don't know what those are","Yes, somewhat",Somewhat more welcome now than last year,,,,,,,,,


Pentru a vedea câte persoane știu Python o să ne folosim de metodele specifice pentru string-uri pe care le-am mai tot folosit pe parcursul tutorialului. Datele respective sunt prezente în cadrul coloanei 'LanguageWorkedWith'. Din cadrul acestui data frame rezultat după filtrare putem să extragem doar acea coloană sub formă se Series.

In [49]:
df.loc[filter_df]['LanguageWorkedWith']

Respondent
8        Bash/Shell/PowerShell;C;C++;HTML/CSS;Java;Java...
10                           C#;Go;JavaScript;Python;R;SQL
15       Assembly;Bash/Shell/PowerShell;C;C++;HTML/CSS;...
50       Bash/Shell/PowerShell;C;C++;HTML/CSS;Java;Java...
65                         Assembly;C;C++;C#;HTML/CSS;Java
                               ...                        
77339                                            Other(s):
79795                 Bash/Shell/PowerShell;Python;SQL;VBA
83862    Assembly;C;C++;HTML/CSS;Java;JavaScript;Object...
84299     HTML/CSS;Java;JavaScript;Python;Swift;TypeScript
86012        Bash/Shell/PowerShell;C++;HTML/CSS;JavaScript
Name: LanguageWorkedWith, Length: 9061, dtype: object

Acum, asupra acelui obiect de tip Series putem utiliza clasa 'str', iar din acea clasă să utilizăm metoda 'containes()' pentru a verifica dacă acele string-uri conțin Python sau nu

In [48]:
df.loc[filter_df]['LanguageWorkedWith'].str.contains('Python')

Respondent
8         True
10        True
15       False
50        True
65       False
         ...  
77339    False
79795     True
83862    False
84299     True
86012    False
Name: LanguageWorkedWith, Length: 9061, dtype: object

Pentru a putea verifica numărul total de persoane care știu Python putem utiliza metoda `sum()`. Poate ne-am imagina că acea metodă funcționează doar pentru valori integer sau float, dar funcționează și pentru valori de tipul boolean. Metoda o putem apela pentru codul de mai sus care ne afișează acele valori booleane

In [50]:
df.loc[filter_df]['LanguageWorkedWith'].str.contains('Python').sum()

3105

Înainte, în momentul în care doream să apelăm o funcție de agregare similară pentru obiectul de tip 'DataFrameGroupBy' pur și simplu utilizam același aproach pentru acel obiect. Poate ne gândinm că în loc să utilizăm acel filtru (df.loc[filter_df]) putem să utilizăm obiectul respectiv și să păstrăm restul codului și ar trebuie să obținem rezultatul care dorim.

In [51]:
groupby_country_df['LanguageWorkedWith'].str.contains('Python').sum()

AttributeError: 'SeriesGroupBy' object has no attribute 'str'

Dacă rulăm acel cod, atunci o să ne apară o eroare. Eroarea ne spune că obiectul de tip 'SeriesGroupBy' nu are atributul 'str' implementat. Acest obiect conține un grup de obiecte de tip Series. Pentru a putea rula o funcție pentru fiecare Series din acel obiect trebuie să utilizăm metoda 'apply()' căreia îi spunem ce metodă să apeleze pentru fiecare obiect de tip Series din cadrul grupului. Funcția respectivă poate fi creată înainte sau se poate utiliza o funcție lambda

In [52]:
groupby_country_df['LanguageWorkedWith'].apply(lambda x : x.str.contains('Python').sum())

Country
Afghanistan                              8
Albania                                 23
Algeria                                 40
Andorra                                  0
Angola                                   2
                                        ..
Venezuela, Bolivarian Republic of...    28
Viet Nam                                78
Yemen                                    3
Zambia                                   4
Zimbabwe                                14
Name: LanguageWorkedWith, Length: 179, dtype: int64

Dacă rulăm acel cod putem observa că obținem numărul persoanelor din fiecare țară care au răspuns că știu ca și limbaj de programare 'Python' la acest studiu. Doar aceste numere nu ne ajută așa de mult dacă am dori să ne facem o idee despre procentajul persoanelor care știu Python din fiecare țară. Pentru a face acest lucru o să ne folosim de mai multe metode ce le-am învățat în cadrul acestui tutorial. Cel mai probabil există mai multe metode de a afla procentajul respectiv

Pentru început o să extragem numărul total de persoane care au participat la studiu din fiecare țară

In [53]:
country_respondents = df['Country'].value_counts()

In [54]:
country_respondents

United States        20949
India                 9061
Germany               5866
United Kingdom        5737
Canada                3395
                     ...  
Tonga                    1
Timor-Leste              1
North Korea              1
Brunei Darussalam        1
Chad                     1
Name: Country, Length: 179, dtype: int64

Următorul pas este să salvăm datele ce reprezintă numărul persoanelor care știu Python din fiecare țară.

In [55]:
country_respondents_use_python = groupby_country_df['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())

In [56]:
country_respondents_use_python

Country
Afghanistan                              8
Albania                                 23
Algeria                                 40
Andorra                                  0
Angola                                   2
                                        ..
Venezuela, Bolivarian Republic of...    28
Viet Nam                                78
Yemen                                    3
Zambia                                   4
Zimbabwe                                14
Name: LanguageWorkedWith, Length: 179, dtype: int64

În acest moment avem două obiecte de tip Series, unul ce conține numărul persoanelor din fiecare țară care au răspuns la acest studiu, iar alt Series conține numărul persoanelor din fiecare țară care știu Python. Urmează să combină aceste două Series. Pentru a le combina o să utilizăm o metodă pe care încă nu am utilizat-o până în acest moment. Putem utiliza metoda `concat()` pentru a cread un data frame din mai multe obiecte de tip Series. Această metodă se apelează pentru librăria Pandas în sine, iar ca și argument trebuie să îi oferim o listă care să conțină numele obiectelor de tip Series ce dorim să le concatenăm

In [None]:
python_df = pd.concat([country_respondents, country_respondents_use_python])

Pentru codul de mai sus mai trebuie să adăugăm încă un argument, și anume axa pe care să se facă această concatenare. În mod inițial această axă este setată la 'rows' și încearcă să facă concatenarea pe rânduri, dar noi vrei să matchuim datele dintr-un Series în altul după index, prin urmare trebuie să schimbăm axa să fie pe coloane. Pentru a schimba această axă se va utiliza argumentul `axis=`

In [57]:
python_df = pd.concat([country_respondents, country_respondents_use_python], axis='columns')

In [58]:
python_df

Unnamed: 0,Country,LanguageWorkedWith
United States,20949,10083
India,9061,3105
Germany,5866,2451
United Kingdom,5737,2384
Canada,3395,1558
...,...,...
Tonga,1,0
Timor-Leste,1,1
North Korea,1,0
Brunei Darussalam,1,0


În acest moment avem un data frame care a fost creat din concatenarea a două obiecte de tip Series, iar concatenarea aceasta s-a produs pe baza index-ului, și anume pe baza numelui țării. În cadrul acestui data frame avem două coloane care nu sunt relevante pentru ce date rețin. O să redenumim aceste coloane pentru a fi cât mai concrete în legătură cu ce date țin.

In [60]:
python_df.rename(columns={'Country' : 'NumRespondents', 'LanguageWorkedWith' : 'NumKnowsPython'}, inplace=True)

In [61]:
python_df

Unnamed: 0,NumRespondents,NumKnowsPython
United States,20949,10083
India,9061,3105
Germany,5866,2451
United Kingdom,5737,2384
Canada,3395,1558
...,...,...
Tonga,1,0
Timor-Leste,1,1
North Korea,1,0
Brunei Darussalam,1,0


În acest moment avem cele două coloane de interes, mai rămâne numai să creem o nouă coloană cu procentajul pentru fiecare valoare. Pentru a crea acel procentaj o să utilizăm regula de 3 simplă și o să salvăm acel rezultat într-o coloană nouă.

In [62]:
python_df['PercKnowsPython'] = (python_df['NumKnowsPython'] / python_df['NumRespondents']) * 100

In [63]:
python_df

Unnamed: 0,NumRespondents,NumKnowsPython,PercKnowsPython
United States,20949,10083,48.131176
India,9061,3105,34.267741
Germany,5866,2451,41.783157
United Kingdom,5737,2384,41.554820
Canada,3395,1558,45.891016
...,...,...,...
Tonga,1,0,0.000000
Timor-Leste,1,1,100.000000
North Korea,1,0,0.000000
Brunei Darussalam,1,0,0.000000


În acest moment coloana respectivă din acest data frame se poate utiliza normal ca orice coloană din cadrul unui data frame. De exemplu putem să sortăm aceste valori sau să aplicăm anumite metode de agregare pentru acestea.

In [64]:
python_df.sort_values(by='PercKnowsPython', ascending=False)

Unnamed: 0,NumRespondents,NumKnowsPython,PercKnowsPython
Sao Tome and Principe,1,1,100.000000
Timor-Leste,1,1,100.000000
Dominica,1,1,100.000000
Niger,1,1,100.000000
Turkmenistan,7,6,85.714286
...,...,...,...
Cape Verde,3,0,0.000000
Lao People's Democratic Republic,3,0,0.000000
Malawi,2,0,0.000000
Liberia,2,0,0.000000


In [65]:
python_df['PercKnowsPython'].median()

34.69387755102041

## Recapitulare

În această parte din tutorial am învățat următoarele lucruri:

    1. utilizarea metodelor de tip agregare pentru o coloană de date (de tip numeric)

        df['ConvertedComp'].median()

    2. Utilizarea metodelor de tip agregare pentru întreg data frame-ul

        df.media(numeric_only=True)

    3. Utilizarea metodei rescribe pentru întreg data frame-ul

        df.describe()

            # returnează date despre întreg data frame-ul (pentru coloanele ce conțin valori numerice), iar aceste date sunt reprezentate de mai multe metode de agregare

    4. Metoda de a număra totalitatea răspunsurilor dintr-o anumită coloană (fără cele care au valoarea 'NaN')

        df['ConvertedComp'].count()

    5. Metoda de a număra fiecare răspuns în parte din cadrul unei coloane

        df['Hobbyist'].value_counts()

            # returnează 'Yes' = 71257, 'No' = 17626 sub formă de Series

    6. Afișarea perocentajului pentru metoda 'value_counts()'

        df['SocialMedia'].value_counts(normalize=True)

    7. Utilizarea metodei groupby

        groupby_country_df = df.groupby(['Country'])

            # returnează un obiect de tip DataFrameGroupBy ce conține mai multe grupuri, unul pentru fiecare țară în parte

    8. Accesarea unui grup dintr-un DdataFrameGroupBy

        groupby_countr_df.get_group('United States')

    9. Utilizarea unei metode pentru un întreg obiect de grupuri

        groupby_country_df['SocialMedia'].value_counts()

    10. Accesarea unui singur grup din cadrul unui obiect de grupuri în urma unei funcții aplicate

        groupby_country_df['SocialMedia'].value_counts().loc['United States']

    11. Utilizarea unei metode de agregare pentru întregul obiect de grupuri

        groupby_country_df['ConvertedComp'].median()

    12. Utilizarea mai multor metode de agregare pentru întreg obiectul de grupuri 

        groupby_country_df['ConvertedComp'].agg(['median', 'mean'])

    13. Aplicarea unei metode pentru un grup de obiecte de tipul Series

        groupby_country_df['LanguageWorkedWith'].apply(lambda x : x.str.contains('Python').sum())

    14. Concatenarea a două obiecte de tip Series într-un data frame

        country_res = df['Country'].value_counts()

        country_res_python = groupby_country_df['LanguageWorkedWith'].apply(lambda x : x.str.contains('Python').sum())

        python_df = pd.concat([country_res, country_res_pytho])

    15. Scimbarea numelor coloanelor din noul data frame 

        python_df.rename(columns={'Country' : 'NumRespondents', 'LanguageWorkedWith' : 'NumKnowsPython'}, inplace=True)

    16. Adăugarea unei noi coloane care reprezintă procentajul persoanelor care știu Python din fiecare țară

        python_df['PercKnowsPython'] = (python_df['NumKnowsPython'] / python_df['NumRespondents']) * 100

            