# Pandas Basics

## Timeline
- 2008: Development of pandas started in [AQR Capital Management](https://www.aqr.com/)
- 2009: pandas becomes open source
- 2012: First edition of Python for Data Analysis is published
- 2015: pandas becomes a NumFOCUS sponsored project
- 2018: First in-person core developer sprint

More about pandas <https://pandas.pydata.org/about/index.html>



### Step 1. Import the necessary libraries

If you don't have already, make a new virtual environment:

```shell
pyenv virtualenv 3.13.5 cs181
```

Then activate the virtual environment:
```shell
pyenv activate cs181
```

Now start managing this virtual environment with poetry
```shell
poetry init
```

Now add **pandas** to dependencies.

Run this in the terminal:
```shell
poetry add pandas
poetry add matplotlib
poetry add seaborn
```

Which will do **pip install pandas** in the activated virtual environment.


Then choose  the virtual environment with the dependencies:

![](./img/6.1-pandas/virtual-environment.png)

In [None]:
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

### Step 3. Assign it to a variable called users and use the 'user_id' as index

In [None]:
users = pd.read_csv(
    "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user",
    sep="|",
    index_col="user_id",
)
users

### Step 4. See the first 25 entries
Solve it by 2 different ways

In [None]:
users.head(25)

In [None]:
users.iloc[0:25]

### Step 5. See the last 10 entries

In [None]:
users.tail(10)

In [None]:
users.iloc[-10:]

### Step 6. What is the number of observations in the dataset?
Solve it by 2 different ways

In [None]:
users.shape[0]

In [None]:
len(users.index)

### Step 7. What is the number of columns in the dataset?
Solve it by 2 different ways

In [None]:
users.shape[1]

In [None]:
len(users.columns)

### Step 8. Print the name of all the columns.

In [None]:
users.columns

### Step 9. How is the dataset indexed? Get the index values

In [None]:
users.index

### Step 10. What is the data type of each column?
Solve it by 2 different ways

In [None]:
for col in users.columns:
    print(col, users[col].dtype)

### Step 11. Print only the occupation column
Solve it by 2 different ways

In [None]:
users.occupation

In [None]:
users["occupation"]

### Step 12. How many different occupations are in this dataset? How many users per occupation are in the dataset?

In [None]:
len(users["occupation"].unique())

In [None]:
users.groupby(by="occupation").count()

### Step 13. What is the most frequent occupation?

In [None]:
freq = users.groupby(by="occupation").count()

freq[freq["age"] == freq["age"].max()].index[0]

### Step 14. Summarize the DataFrame.

In [None]:
users.info()

In [None]:
users.describe()

### Step 15. What is the mean age of users?

In [None]:
users.age.mean()

### Step 16. What is the age with least occurrence?

In [None]:
age = users.groupby(by="age").count()
age[age["gender"] == 1].index.to_list()

### How many people there are in each occupation ?

In [None]:
df = users.copy()

In [None]:
import matplotlib.pyplot as plt

In [None]:
# Count number of people per occupation
occupation_counts = df["occupation"].value_counts()
occupation_counts

In [None]:
# Plot
plt.figure(figsize=(6, 4))
plt.bar(occupation_counts.index, occupation_counts.values, color="steelblue")
plt.title("Number of People per Occupation")
plt.xlabel("Occupation")
plt.ylabel("Count")
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

What issues we encounter with the following graph ?

In [None]:
occupation_counts2 = occupation_counts.sort_values(ascending=True)
plt.figure(figsize=(6, 4))
plt.barh(occupation_counts2.index, occupation_counts2.values, color="steelblue")
plt.title("Number of People per Occupation")
plt.xlabel("Occupation")
plt.ylabel("Count")
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
import seaborn as sns

sns.countplot(
    data=df,
    y="occupation",
    color="steelblue",
    order=df["occupation"].value_counts().index,
)
