# 1- Installation and loading data in pandas

Pandas is a powerful library for data analysis in Python. It's built using numpy, making it fast and a great option to deal with a lot of data.


## Installing pandas

The first step of using pandas would be installing it. After you've created a Python environment, you can use `pip` (preferred installer program) to install Python packages to that environment. In our case, we can use `pip install pandas` to install pandas to the current environment.

This can be done in many ways, the most common way would be by doing it on the terminal, but if you're using a Jupyter notebook (such as this one), you can use `%` to perform called "**_line magics_**" accessing the terminal directly from a line in a Jupyter cell. We'll use this to install the pandas package as shown bellow.


In [1]:
%pip install pandas

Collecting pandas
  Using cached pandas-2.2.3-cp312-cp312-macosx_11_0_arm64.whl.metadata (89 kB)
Collecting numpy>=1.26.0 (from pandas)
  Downloading numpy-2.1.3-cp312-cp312-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2024.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached pandas-2.2.3-cp312-cp312-macosx_11_0_arm64.whl (11.4 MB)
Downloading numpy-2.1.3-cp312-cp312-macosx_14_0_arm64.whl (5.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m623.7 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hUsing cached pytz-2024.2-py2.py3-none-any.whl (508 kB)
Using cached tzdata-2024.2-py2.py3-none-any.whl (346 kB)
Installing collected packages: pytz, tzdata, numpy, pandas
Successfully installed numpy-2.1.3 pandas-2.2.3 pytz-2024.2 tzdata-2024.2
Note: you may need to restart the kernel to use upda

## Importing pandas

To import pandas, as well as for any other packages in python, you use the keyword `import`. For pandas and some other packages, its convention to create an alias for them. This is done at the moment of importation of the data by using the keyword `as *`, after the importing of the package.

In the case of `pandas` the alias is `pd`, for `numpy` its `np`, for `matplotlib.pyplot` its `plt`, and so on.


In [1]:
import pandas as pd

## Reading data from pandas

Pandas can read a plethora of file types. By using a methods that starts with `pd.read_*` and passing the file path as an argument, you can read most types of data. Our specific use case is one of the most common types, a `.csv` file, making it so that we'll use `pd.read_csv()`.

These methods will yield an object called `DataFrame`. This object is the most important one of the pandas library on its entirety.


In [2]:
df = pd.read_csv('stack-overflow-developer-survey-2019/survey_results_public.csv')

In [None]:
schema_df = pd.read_csv(
    'stack-overflow-developer-survey-2019/survey_results_schema.csv'
)

## Showing info on a DF

We'll get more into Data Frames in the next files, but we can figure some stuff about our data from the get go.

This can be done by using methods such as `head` that shows the first lines of a DF, `tail` that shows the last lines of the DF, shape (that is an attribute, but for the sake of the argument I'll leave it here) that reveals the shape of a DF (rows and columns), and info that reveals some information on what is the type of data contained in each of the columns.


In [None]:
# We can specify how many rows are shown
df.tail(10)

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
88873,88062,,No,Never,"OSS is, on average, of LOWER quality than prop...",,,,,,...,,,,,,,,,,
88874,88076,,No,Never,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,,,,,...,,,,,,,,,,
88875,88182,,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed part-time,Pakistan,,"Secondary school (e.g. American high school, G...",,...,Not applicable - I did not use Stack Overflow ...,Courses on technologies you're interested in,,Man,No,Straight / Heterosexual,,Yes,Too short,Neither easy nor difficult
88876,88212,,No,Less than once per year,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Spain,No,"Secondary school (e.g. American high school, G...",,...,,Tech articles written by other developers;Indu...,40.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
88877,88282,,Yes,Once a month or more often,The quality of OSS and closed source software ...,"Not employed, but looking for work",United States,No,Some college/university study without earning ...,"Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,,,Man,No,Straight / Heterosexual,,No,Too short,Neither easy nor difficult
88878,88377,,Yes,Less than once a month but more than once per ...,The quality of OSS and closed source software ...,"Not employed, and not looking for work",Canada,No,Primary/elementary school,,...,,Tech articles written by other developers;Tech...,,Man,No,,,No,Appropriate in length,Easy
88879,88601,,No,Never,The quality of OSS and closed source software ...,,,,,,...,,,,,,,,,,
88880,88802,,No,Never,,Employed full-time,,,,,...,,,,,,,,,,
88881,88816,,No,Never,"OSS is, on average, of HIGHER quality than pro...","Independent contractor, freelancer, or self-em...",,,,,...,,,,,,,,,,
88882,88863,,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...","Not employed, and not looking for work",Spain,"Yes, full-time","Professional degree (JD, MD, etc.)","Computer science, computer engineering, or sof...",...,Somewhat less welcome now than last year,Tech articles written by other developers;Indu...,18.0,Man,No,Straight / Heterosexual,Hispanic or Latino/Latina;White or of European...,No,Appropriate in length,Easy


In [4]:
df.shape

(88883, 85)

In [8]:
schema_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Column        85 non-null     object
 1   QuestionText  85 non-null     object
dtypes: object(2)
memory usage: 1.5+ KB
