## Series

Create a [Series](https://www.w3schools.com/python/pandas/pandas_series.asp) 
by using `pd.Series([])`. It will accept a Python list, a NumPy array, or a Python Dictionary. By default the series will have index numbers, but it is possible to create **labelled* indexes instead. 

In [8]:
# import pandas
import pandas as pd
df = pd.read_csv('ufo_sightings_comma.csv')
print(df.head()) 


         Date     City State     Shape  Duration (sec)  Credibility Score
0  2023-10-13  Roswell    NM      Disk              45                8.5
1  2024-01-05  Phoenix    AZ  Triangle             120                7.0
2  2023-12-21    Salem    OR    Sphere              60                6.5
3  2024-03-11    Dulce    NM     Cigar              30                9.1
4  2024-03-11    Dulce    NM     Cigar              30                9.1


In [10]:
# create labelled index
data = ["a","b","c"]
index_labels = ["A","B","C"]
s = pd.Series(data, index= index_labels)
s





A    a
B    b
C    c
dtype: object

## DataFrame

Create a new DataFrame using `pd.DataFrame({})`, which takes a Python dictionary. The keys will be the column labels, and the values will need to be Python lists which will be converted into Series. 

In [250]:
# create dataframe
df = pd.DataFrame({
    "ABC": ["A","B","C"],
    "abc": ["a","b","c"],
    "123": [1,2,3]
})



# compare print function with ipynb print
print(df)
df

  ABC abc  123
0   A   a    1
1   B   b    2
2   C   c    3


Unnamed: 0,ABC,abc,123
0,A,a,1
1,B,b,2
2,C,c,3


In [12]:
# create data frame
import pandas as pd
df = pd.DataFrame({
    "ABC": ["A","B","C"],
    "abc": ["a","b","c"],
    "123": [1,2,3]
    })
print(df)
df

  ABC abc  123
0   A   a    1
1   B   b    2
2   C   c    3


Unnamed: 0,ABC,abc,123
0,A,a,1
1,B,b,2
2,C,c,3


In [17]:
new_df = df
new_df[:2] = 100
new_df
df


Unnamed: 0,ABC,abc,123
0,100,100,100
1,100,100,100
2,C,c,3


In [20]:
# show how original is overwritten
# new_df = df
# new_df[:2]=100
# new_df
# df

# repeat using copy
new_df = df.copy()
new_df[:2]=100
new_df
df


Unnamed: 0,ABC,abc,123
0,100,100,100
1,100,100,100
2,C,c,3


## Working with CSV files

We will mostly not create our own DataFrames, but rather with imported CSV files using
- `pd.read_csv()`



In [26]:
# 1. Load CSV with comma separator (default)
# ufos = pd.read_csv('ufo_sightings_comma.csv')
# ufos 

# 2. Load CSV with semicolon separator
ufos_sem = pd.read_csv('ufo_sightings_semicolon.csv', sep=";")
ufos_sem


Unnamed: 0,Date,City,State,Shape,Duration (sec),Credibility Score
0,2023-10-13,Roswell,NM,Disk,45,8.5
1,2024-01-05,Phoenix,AZ,Triangle,120,7.0
2,2023-12-21,Salem,OR,Sphere,60,6.5
3,2024-03-11,Dulce,NM,Cigar,30,9.1
4,2023-11-02,Sedona,AZ,Fireball,300,5.5
5,2024-02-28,Aurora,TX,Boomerang,180,6.8
6,2023-08-19,Kecksburg,PA,Light,15,4.2
7,2023-07-04,Area 51,NV,Unknown,200,9.8
8,2024-04-01,Fairfield,CA,Oval,90,7.3
9,2023-09-09,Point Pleasant,WV,Rectangle,75,6.0


#### Inspecting the Data
- `.head()`shows the first 5 rows
- `.tail()`shows the last 5 rows
- `.info()`shows NaN, dtyps, memory usage
- `.describe()`shows main stats
- `.columns`shows column headers
- `.shape`shows number of rows and columns as a tuple (a,b)

In [None]:
# Display first few rows of each
ufos.head()

# Display last few rows of each
ufos.tail(7)

Unnamed: 0,Date,City,State,Shape,Duration (sec),Credibility Score
4,2024-03-11,Dulce,NM,Cigar,30,9.1
5,2023-11-02,Sedona,AZ,Fireball,300,
6,2023-09-27,Aurora,TX,Boomerang,180,6.8
7,2023-08-19,Kecksburg,PA,Light,15,4.2
8,2023-07-04,Area 51,NV,Unknown,200,9.8
9,2024-04-01,Fairfield,CA,Oval,90,7.3
10,2023-09-09,Point Pleasant,WV,Rectangle,75,6.0


In [33]:
# # Info 
ufos.info()

# # Describe
ufos.describe()

# Columns
ufos.columns

# # Shape
ufos.shape




(11, 6)

## Selecting Data 
#### Slicing rows OR columns

You can slice a range of row using the **row indexes**: `DataFrame[from:to]`. These slices follow the same index pattern as in NumPy. They ca not be combined with column slicing.

You can select columns using the **column names**: `DataFrame["City"]`or `DataFrame[["City", "State"]]`. You can´t select a range of columns, each desired column must be listed individually.

In [38]:
# row indexing [:]
ufos
ufos[3:]


Unnamed: 0,Date,City,State,Shape,Duration (sec),Credibility Score
3,2024-03-11,Dulce,NM,Cigar,30,9.1
4,2024-03-11,Dulce,NM,Cigar,30,9.1
5,2023-11-02,Sedona,AZ,Fireball,300,
6,2023-09-27,Aurora,TX,Boomerang,180,6.8
7,2023-08-19,Kecksburg,PA,Light,15,4.2
8,2023-07-04,Area 51,NV,Unknown,200,9.8
9,2024-04-01,Fairfield,CA,Oval,90,7.3
10,2023-09-09,Point Pleasant,WV,Rectangle,75,6.0


In [40]:
# Single column by label
ufos["City"]

# Multiple columns by label
ufos[["City", "Shape"]]


Unnamed: 0,City,Shape
0,Roswell,Disk
1,Phoenix,Triangle
2,Salem,Sphere
3,Dulce,Cigar
4,Dulce,Cigar
5,Sedona,Fireball
6,Aurora,Boomerang
7,Kecksburg,Light
8,Area 51,Unknown
9,Fairfield,Oval


## .loc & .iloc

To combine row/column slices, use `iloc` or `loc`. It is very similar to NumPy slicing:
  - `DataFrame.iloc[]` to slice by **index**:
    - `DataFrame.iloc[3]` will return row 3 as a Series
    - `DataFrame.iloc[0:3]` will return the first 3 rows
    - `DataFrame.iloc[2,1]` will return return single value for row/column combination
    - `DataFrame.iloc[[0,2], 0:3]` will return row 0 and row 2 and a range of the first 2 columns
    
  - `DataFrame.loc[]` to slice by **label**:
    - `DataFrame.loc[3]` will return row 3 as a Series
    - `DataFrame.loc[0:3]` will return the first 3 rows
    - `DataFrame.loc[2, "Label1"]` will return return single value for row/column combination
    - `DataFrame.loc[[0,2], ["Label1", "Label2"]]`will return row 0 and row 2 and the two chosen columns

In [45]:
ufos

Unnamed: 0,Date,City,State,Shape,Duration (sec),Credibility Score
0,2023-10-13,Roswell,NM,Disk,45,8.5
1,2024-01-05,Phoenix,AZ,Triangle,120,7.0
2,2023-12-21,Salem,OR,Sphere,60,6.5
3,2024-03-11,Dulce,NM,Cigar,30,9.1
4,2024-03-11,Dulce,NM,Cigar,30,9.1
5,2023-11-02,Sedona,AZ,Fireball,300,
6,2023-09-27,Aurora,TX,Boomerang,180,6.8
7,2023-08-19,Kecksburg,PA,Light,15,4.2
8,2023-07-04,Area 51,NV,Unknown,200,9.8
9,2024-04-01,Fairfield,CA,Oval,90,7.3


In [55]:
# Example for .iloc[]
ufos.iloc[[0,2],0:2]

# ufos.iloc[[0,2],0:2]
# ufos
# ufos.iloc[2,3]

Unnamed: 0,Date,City
0,2023-10-13,Roswell
2,2023-12-21,Salem


In [56]:
# Example for .loc[]
ufos.loc[[0,2],["Date","City"]]

Unnamed: 0,Date,City
0,2023-10-13,Roswell
2,2023-12-21,Salem


## Filtering by Condition

You can use conditional statements to filter the desired data. If the "boolean index" returns `True`, the value will be included in the slice. Specify multiple conditions by enclosing each in parentheses and combining them with `&` or `|`. 

In [259]:
# Bool
ufos["State"] == "NM"    # shows only New Mexico, but as Boolean

# Filter by state AZ
ufos[ufos["State"] == "NM"]  

# Filter credibility score > 8
ufos[ufos["Credibility Score"] > 8]

# Connect two conditions with &
ufos[(ufos["Credibility Score"] > 8) & (ufos["Duration (sec)"] > 45)]

# Connect two conditions with |
ufos[(ufos["Credibility Score"] > 8) | (ufos["Duration (sec)"] > 60)]


Unnamed: 0,Date,City,State,Shape,Duration (sec),Credibility Score
0,2023-10-13,Roswell,NM,Disk,45,8.5
1,2024-01-05,Phoenix,AZ,Triangle,120,7.0
3,2024-03-11,Dulce,NM,Cigar,30,9.1
4,2024-03-11,Dulce,NM,Cigar,30,9.1
5,2023-11-02,Sedona,AZ,Fireball,300,
6,2023-09-27,Aurora,TX,Boomerang,180,6.8
8,2023-07-04,Area 51,NV,Unknown,200,9.8
9,2024-04-01,Fairfield,CA,Oval,90,7.3
10,2023-09-09,Point Pleasant,WV,Rectangle,75,6.0


## Sorting
- `.sort_values("Column")` will sort by the chosen column `(ascending=True)` is the default setting
- `.sort_values(["Column1", "Column2"])` will sort first by Column1 and within that by column2, sorting order can be customized like this: `ascending=[True, False]`



In [260]:
# sort by credibility score
ufos.sort_values("State", ascending=False)

# sort by state (ascending) and duration (descending)
ufos.sort_values(["State", "Duration (sec)"], ascending=[1,0])  # You can also use ascending=[True, False] 

Unnamed: 0,Date,City,State,Shape,Duration (sec),Credibility Score
5,2023-11-02,Sedona,AZ,Fireball,300,
1,2024-01-05,Phoenix,AZ,Triangle,120,7.0
9,2024-04-01,Fairfield,CA,Oval,90,7.3
0,2023-10-13,Roswell,NM,Disk,45,8.5
3,2024-03-11,Dulce,NM,Cigar,30,9.1
4,2024-03-11,Dulce,NM,Cigar,30,9.1
8,2023-07-04,Area 51,NV,Unknown,200,9.8
2,2023-12-21,Salem,OR,Sphere,60,6.5
7,2023-08-19,Kecksburg,PA,Light,15,4.2
6,2023-09-27,Aurora,TX,Boomerang,180,6.8


 ## Exploratory Data Analysis (EDA)
 
 
 To familiarize yourself with the dataset, you might also want to use the functions below. 

**Tip:** For these and most of the following functions, there is a default argument of `inplace=False`. This indicates that the function should return a _new_ DataFrame/Series which you must save to a varible. If you wish to manipulate the original DataFrame, you can manually set the keyword argument `inplace=True`.



#### Check for Duplicates
- `.duplicated()` shows an overview of duplicate rows. 
- `.duplicated(keep=False)` will output all duplicate rows. 
- `.duplicated(subset=["Column Name"])`will let you look for duplicates in specific columns.

To get a tabular non-boolean output you need to wrap the function like this:
`df[df.duplicated()]`.

In [59]:
# check for duplicate rows
ufos[ufos.duplicated()]

# check for duplicate rows (keep=False)
ufos[ufos.duplicated(keep=False)]

# # check for duplicate rows (subset=[""], keep=Fase)
ufos[ufos.duplicated(subset=["State"], keep=False)]

Unnamed: 0,Date,City,State,Shape,Duration (sec),Credibility Score
0,2023-10-13,Roswell,NM,Disk,45,8.5
1,2024-01-05,Phoenix,AZ,Triangle,120,7.0
3,2024-03-11,Dulce,NM,Cigar,30,9.1
4,2024-03-11,Dulce,NM,Cigar,30,9.1
5,2023-11-02,Sedona,AZ,Fireball,300,


#### Check for Unique Values
- `.nunique()` will give you an overview of unique values per each column.


In [262]:
# Show the number of unique values per column
ufos.nunique()


Date                 10
City                 10
State                 8
Shape                10
Duration (sec)       10
Credibility Score     9
dtype: int64

#### Check for Null Values
- `.isnull()`gives you a boolean NaN values
- `.isnull().sum()`gives you a tabular summary of null values per column


In [60]:
# Show the total number of null (missing) values per column
ufos.isnull().sum()


Date                 0
City                 0
State                0
Shape                0
Duration (sec)       0
Credibility Score    1
dtype: int64

#### Value Count
- `df.value_counts()` shows how often each full row appears in the df
- `df.value_counts()[df.value_counts() > 1]` shows all rows that appear more than ones
- `df["Column Name"].value_counts()`counts the value frequency of the chosen column


In [264]:
# Show how many times each full row appears in the DataFrame
ufos.value_counts()

# Show only full rows that occur more than once
ufos.value_counts()[ufos.value_counts() >1]


# Show value count for State
ufos["State"].value_counts()


State
NM    3
AZ    2
OR    1
TX    1
PA    1
NV    1
CA    1
WV    1
Name: count, dtype: int64

## Modify Your DataFrame

Once you are familiar with your data, you'll need to start cleaning it. Some useful Pandas functions to achieve this include:

#### Rename Columns
- `DataFrame.rename(columns={ "old": "new" })` reassign labels or indexes
- `DataFrame.rename({ "old": "new" }, axis=1)` reassign labels or indexes



In [265]:
ufos.rename({"Shape":"Form"}, axis=1)


Unnamed: 0,Date,City,State,Form,Duration (sec),Credibility Score
0,2023-10-13,Roswell,NM,Disk,45,8.5
1,2024-01-05,Phoenix,AZ,Triangle,120,7.0
2,2023-12-21,Salem,OR,Sphere,60,6.5
3,2024-03-11,Dulce,NM,Cigar,30,9.1
4,2024-03-11,Dulce,NM,Cigar,30,9.1
5,2023-11-02,Sedona,AZ,Fireball,300,
6,2023-09-27,Aurora,TX,Boomerang,180,6.8
7,2023-08-19,Kecksburg,PA,Light,15,4.2
8,2023-07-04,Area 51,NV,Unknown,200,9.8
9,2024-04-01,Fairfield,CA,Oval,90,7.3


#### Convert to Datetime
-   `pd.to_datetime(df[""])` converts the values of selected column to `datetime`


In [266]:
# convert ufos["Date"] to datetime

ufos["Date"] = pd.to_datetime(ufos["Date"])
ufos.dtypes


Date                 datetime64[ns]
City                         object
State                        object
Shape                        object
Duration (sec)                int64
Credibility Score           float64
dtype: object

#### Replace or Drop NaN / None Values
 - `.fillna()` replace null values with chosen alternative value (i.e.mean or median)
 - `dropna()`drops rows with null values
 - Create a new variable to store the altered table.
 - Alternatively you can alter the original table by adding `inplace=True` to the ().

In [267]:
# Median of Credibility Score
ufos["Credibility Score"].median()

np.float64(7.15)

In [268]:
ufos

Unnamed: 0,Date,City,State,Shape,Duration (sec),Credibility Score
0,2023-10-13,Roswell,NM,Disk,45,8.5
1,2024-01-05,Phoenix,AZ,Triangle,120,7.0
2,2023-12-21,Salem,OR,Sphere,60,6.5
3,2024-03-11,Dulce,NM,Cigar,30,9.1
4,2024-03-11,Dulce,NM,Cigar,30,9.1
5,2023-11-02,Sedona,AZ,Fireball,300,
6,2023-09-27,Aurora,TX,Boomerang,180,6.8
7,2023-08-19,Kecksburg,PA,Light,15,4.2
8,2023-07-04,Area 51,NV,Unknown,200,9.8
9,2024-04-01,Fairfield,CA,Oval,90,7.3


In [269]:
# dropna and changing original df
# ufos = ufos.dropna()              # option 1:assigning new variable (no inplace = True)
ufos.dropna(inplace=True)       # option 2: no new variable,use (inplace = True insterad)

# Instead of dropping rows with NaN values, you could also replace them with their mean or median
# ufos["Credibility Score"] = ufos["Credibility Score"].fillna(ufos["Credibility Score"].median())
ufos

Unnamed: 0,Date,City,State,Shape,Duration (sec),Credibility Score
0,2023-10-13,Roswell,NM,Disk,45,8.5
1,2024-01-05,Phoenix,AZ,Triangle,120,7.0
2,2023-12-21,Salem,OR,Sphere,60,6.5
3,2024-03-11,Dulce,NM,Cigar,30,9.1
4,2024-03-11,Dulce,NM,Cigar,30,9.1
6,2023-09-27,Aurora,TX,Boomerang,180,6.8
7,2023-08-19,Kecksburg,PA,Light,15,4.2
8,2023-07-04,Area 51,NV,Unknown,200,9.8
9,2024-04-01,Fairfield,CA,Oval,90,7.3
10,2023-09-09,Point Pleasant,WV,Rectangle,75,6.0


#### Drop Duplicate Rows
  - `.drop_duplicates()` drops duplicate rows
  - `.reset_index(drop=True)` resets indexes after dropping columns.

Create a variable to hold the new table. Use the original dataframe name if you want to save the changes directly to it. Or use `(inplace=True)`. Alternatively use a different variable name to protext the original.


In [270]:
# Drop duplicates inplace=True (alternative would be to create ufos variable again)

ufos.drop_duplicates(inplace=True)
ufos

# reset index and drop extra index column
# ufos.reset_index(drop=True, inplace=True)
# ufos

Unnamed: 0,Date,City,State,Shape,Duration (sec),Credibility Score
0,2023-10-13,Roswell,NM,Disk,45,8.5
1,2024-01-05,Phoenix,AZ,Triangle,120,7.0
2,2023-12-21,Salem,OR,Sphere,60,6.5
3,2024-03-11,Dulce,NM,Cigar,30,9.1
6,2023-09-27,Aurora,TX,Boomerang,180,6.8
7,2023-08-19,Kecksburg,PA,Light,15,4.2
8,2023-07-04,Area 51,NV,Unknown,200,9.8
9,2024-04-01,Fairfield,CA,Oval,90,7.3
10,2023-09-09,Point Pleasant,WV,Rectangle,75,6.0


#### Aggregation

**Aggregation** methods from NumPy can also be applied in to any numerical columns in Pandas, including filtered selections. Some methods include:
  - `.sum()`
  - `.min()` and `.max()`
  - `.count()`
  - `.mean()`
  - `.std()`

In [271]:
# Show aggregation for Credibility Score
ufos["Credibility Score"].std()

np.float64(1.702286044640494)

#### Groupby 
- `.groupby("Column").mean(numeric_only=True)`will group the df by Column and aggregate the mean for every numerical column, clustered by the each group.
- `ufos.groupby("Column1")["Column2"].max()` will group by Column1 and output the max value a chosen Column2. 

In [272]:
# group df by State / get mean for every numerical column
ufos.groupby("State").mean(numeric_only=True)

# # group df by State / get max Credibility Score
ufos.groupby("State")["Credibility Score"].max()


State
AZ    7.0
CA    7.3
NM    9.1
NV    9.8
OR    6.5
PA    4.2
TX    6.8
WV    6.0
Name: Credibility Score, dtype: float64

#### Creating and Modifying Columns

In [273]:
ufos

Unnamed: 0,Date,City,State,Shape,Duration (sec),Credibility Score
0,2023-10-13,Roswell,NM,Disk,45,8.5
1,2024-01-05,Phoenix,AZ,Triangle,120,7.0
2,2023-12-21,Salem,OR,Sphere,60,6.5
3,2024-03-11,Dulce,NM,Cigar,30,9.1
6,2023-09-27,Aurora,TX,Boomerang,180,6.8
7,2023-08-19,Kecksburg,PA,Light,15,4.2
8,2023-07-04,Area 51,NV,Unknown,200,9.8
9,2024-04-01,Fairfield,CA,Oval,90,7.3
10,2023-09-09,Point Pleasant,WV,Rectangle,75,6.0


In [274]:
# Create new columns "Duration (min)"
ufos["Duration (min)"] = ufos["Duration (sec)"] / 60
ufos




Unnamed: 0,Date,City,State,Shape,Duration (sec),Credibility Score,Duration (min)
0,2023-10-13,Roswell,NM,Disk,45,8.5,0.75
1,2024-01-05,Phoenix,AZ,Triangle,120,7.0,2.0
2,2023-12-21,Salem,OR,Sphere,60,6.5,1.0
3,2024-03-11,Dulce,NM,Cigar,30,9.1,0.5
6,2023-09-27,Aurora,TX,Boomerang,180,6.8,3.0
7,2023-08-19,Kecksburg,PA,Light,15,4.2,0.25
8,2023-07-04,Area 51,NV,Unknown,200,9.8,3.333333
9,2024-04-01,Fairfield,CA,Oval,90,7.3,1.5
10,2023-09-09,Point Pleasant,WV,Rectangle,75,6.0,1.25


#### Dropping Columns
-   `.drop()` drops selected rows (`axis=0`) or columns (`axis=1`) from the DataFrame

In [275]:
# drop selected column Duration (min)
ufos = ufos.drop(columns=["Duration (min)"])
ufos

# alternative: ufos.drop(columns=["Duration (min)"], inplace=True)

Unnamed: 0,Date,City,State,Shape,Duration (sec),Credibility Score
0,2023-10-13,Roswell,NM,Disk,45,8.5
1,2024-01-05,Phoenix,AZ,Triangle,120,7.0
2,2023-12-21,Salem,OR,Sphere,60,6.5
3,2024-03-11,Dulce,NM,Cigar,30,9.1
6,2023-09-27,Aurora,TX,Boomerang,180,6.8
7,2023-08-19,Kecksburg,PA,Light,15,4.2
8,2023-07-04,Area 51,NV,Unknown,200,9.8
9,2024-04-01,Fairfield,CA,Oval,90,7.3
10,2023-09-09,Point Pleasant,WV,Rectangle,75,6.0


#### Applying Functions
There are many more possibilities. These functions are all very customizable, most accepting multiple optional arguments. Refer to the Pandas documentation for [general functions](https://pandas.pydata.org/docs/reference/general_functions.html), [Series](https://pandas.pydata.org/docs/reference/series.html), and [DataFrames](https://pandas.pydata.org/docs/reference/frame.html) for more information whenever you are using one - it might do more than you think! 

W3Schools also offer a [simplified documentation](https://www.w3schools.com/python/pandas/pandas_intro.asp) that is great for beginners. If built in functions are not able to fullfill our requrirement, we can also use our own function. 
- `.apply()` can be used with a custom function to create a new column based on logic

In [276]:
# Pick the desired order for columns (there are also many other ways to do this)
ufos = ufos[["Date", "State", "City", "Shape", "Duration (sec)", "Credibility Score"]]
ufos

Unnamed: 0,Date,State,City,Shape,Duration (sec),Credibility Score
0,2023-10-13,NM,Roswell,Disk,45,8.5
1,2024-01-05,AZ,Phoenix,Triangle,120,7.0
2,2023-12-21,OR,Salem,Sphere,60,6.5
3,2024-03-11,NM,Dulce,Cigar,30,9.1
6,2023-09-27,TX,Aurora,Boomerang,180,6.8
7,2023-08-19,PA,Kecksburg,Light,15,4.2
8,2023-07-04,NV,Area 51,Unknown,200,9.8
9,2024-04-01,CA,Fairfield,Oval,90,7.3
10,2023-09-09,WV,Point Pleasant,Rectangle,75,6.0


In [278]:
# Add a "Category" based on credibility
def credibility_category(score):
    if score > 8:
        return "high"
    elif score > 6:
        return "medium"
    else:
        return "low"
        
    
# Ufo sightings["credibility category"]
ufos["Credibility Category"] = ufos["Credibility Score"].apply(credibility_category)
ufos

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ufos["Credibility Category"] = ufos["Credibility Score"].apply(credibility_category).copy()


Unnamed: 0,Date,State,City,Shape,Duration (sec),Credibility Score,Credibility Category
0,2023-10-13,NM,Roswell,Disk,45,8.5,high
1,2024-01-05,AZ,Phoenix,Triangle,120,7.0,medium
2,2023-12-21,OR,Salem,Sphere,60,6.5,medium
3,2024-03-11,NM,Dulce,Cigar,30,9.1,high
6,2023-09-27,TX,Aurora,Boomerang,180,6.8,medium
7,2023-08-19,PA,Kecksburg,Light,15,4.2,low
8,2023-07-04,NV,Area 51,Unknown,200,9.8,high
9,2024-04-01,CA,Fairfield,Oval,90,7.3,medium
10,2023-09-09,WV,Point Pleasant,Rectangle,75,6.0,low


In [None]:
#Alternative using Lambda Function
ufos["Credibility Category"] = ufos["Credibility Score"].apply(
    lambda x: "High" if x >= 8 else "Medium" if x >= 6 else "Low"
)

#### Exporting Data

In [None]:
# Create filtered df with high credibitily score

filtered = ufos[ufos["Credibility Score"] > 8]
filtered

Unnamed: 0,Date,State,City,Shape,Duration (sec),Credibility Score,Credibility Category
0,2023-10-13,NM,Roswell,Disk,45,8.5,high
3,2024-03-11,NM,Dulce,Cigar,30,9.1,high
6,2023-07-04,NV,Area 51,Unknown,200,9.8,high


In [None]:
# filtered.to csv("credible sightings.csv", index= False) this will not create a new column for the index
filtered.to_csv('high_credibility.csv', index=False)