In [None]:
#import the pandas library. I like to use the full names of my libraries in my code so it is easier to read

import pandas as pandas
from pandas.api import types as pdtypes

openfield_file = pandas.read_csv('file_path')

## This command allows me to see all of the columns in the dataframe when I call the .head() method

pandas.set_option('display.max_columns', None)

openfield_file.head()

In [None]:
#Let's drop any rows that do not have the value 'Contact' in the column 'conversation_type'

openfield_file = openfield_file.drop((openfield_file[openfield_file['conversation_type'] != 'Contact'].index))

#Let's see if it worked
openfield_file.head()


Let's break down how this worked! .drop() method in pandas expects row labels (index values) or column names as its argument, not a boolean mask.

--> *df['column_name'] == 'specific_value'* creates a boolean mask where True indicates rows that meet the condition.
--> *df[df['column_name'] == 'specific_value']* selects these rows from the DataFrame.
--> *.index* retrieves the index labels of these selected rows.
--> *.drop()* drops the rows whose index shows up after the boolean mask

In [None]:
#Drop the columns you need to drop

openfield_file = openfield_file.drop(columns=['column1',
        'column2',
        'column2'])

#Let's see if it worked
openfield_file.head()

In [None]:
# Let's rename the columns to make our dataframe easier to read. We will use the .rename() method
# df = df.rename(columns={'old_name1': 'new_name1', 'old_name2': 'new_name2'})

openfield_file = openfield_file.rename(columns={
    'old_column_name1':'new_column_name1',
    'old_column_name2':'new_column_name2',}
    )

#Let's see if it worked
openfield_file.head()

The .rename() method works as follows

--> It takes a columns parameter, which is a dictionary {key:value}.
--> In this dictionary, the keys are the current column names you want to change, and the values are the new names you want to assign.
--> The method returns a new DataFrame with the updated column names, which we reassign to the dataframe.
--> Any column names not specified in the dictionary remain unchanged.

*You can do this a different way*

Directly assigning new names to the columns attribute:
df.columns = ['new_name1', 'new_name2', 'new_name3']

This approach works as follows:

--> df.columns is an attribute of the DataFrame that holds the column labels.
--> By assigning a new list to df.columns, you're directly replacing all the column names.
--> The new list must have the same length as the number of columns in your DataFrame.
--> This method changes all column names at once, in the order they appear in the DataFrame.

Key differences:

**The rename() method is more flexible as it allows you to change specific columns without affecting others.**
Direct assignment to df.columns is simpler but requires you to specify new names for all columns.
rename() creates a new DataFrame (though the assignment df = df.rename(...) makes it seem like it's changing in place), while df.columns = ... modifies the existing DataFrame.


In [None]:
# While we aren't going to use this dictionary for the code, 
# We want to create this dictionary so we can export the dictionary as a dataframe
# So people who use our data know what each column and question correspond to

# Let's use the dictionary from the .rename() method above and save it as a new variable

openfield_question_column_dictionary = {
    'question1':'shortened_column_name1',
    'question2':'shortened_column_name2',
}
# Let's see if the dictionary saved properly
print(openfield_question_column_dictionary)


In [None]:
# Now turn the dictionary from before into a dataframe we can save to csv.

openfield_question_column_file = pandas.DataFrame.from_dict(
    openfield_question_column_dictionary,
    orient='index')

openfield_question_column_file.head()


In [None]:
#Create a file with the column names that you can use later as a reference if needed

openfield_question_column_file.to_csv('new_file_path')

In [None]:
## Assign the variable "column_list" a list of the column names you want to work on. 
## This variable will be used later!

column_list = list(openfield_file.columns)

#Make sure the columns are listed by printing the list

print(column_list)


In [None]:
#Make sure that each of the columns in the list are an object/string type.

for column in column_list:
    if column in openfield_file.columns:
        openfield_file[column] = openfield_file[column].astype(str)

#Use the .info() method to make sure it worked!
openfield_file.info()

In [None]:
# Create a function called "clean_column" that takes one argument "column" and
    # first tests if the datatype is string
        # and processes a str.replace() method on the argument "column"
    # then returns the output as "column"
    ## It will only do this on columns whose data type is "string" a.k.a. "object" in pandas

def clean_column(column):
    if pandas.api.types.is_string_dtype(column):
        return column.str.replace(r"[\[\]'\"]", "", regex=True)
    else:
        print("Column {column} is not string type, skipping")
    return column

*pandas.api.types* is a submodule of pandas that provides a collection of data type-related functions and utilities. Here's a breakdown:

**Namespace:** It's a way to organize related functionality within the pandas library.

**Purpose:** This submodule contains functions for working with, checking, and manipulating data types in pandas.

**Common uses**:
-->Checking data types of Series or DataFrame columns
-->Determining if a data type belongs to a certain category (e.g., numeric, string, etc.)
-->Converting between different data types

**Some common functions in this module:**

-->*is_numeric_dtype()*: Checks if a dtype is numeric
-->*is_datetime64_any_dtype()*: Checks if a dtype is any kind of datetime64 dtype
-->*is_categorical_dtype()*: Checks if a dtype is of the Categorical type
-->*is_string_dtype()*: Checks if a dtype is a string type

In [None]:
# Use an if statement that applies the "clean_column()"" function to iterate over each column in "columns_to_clean"
    #The first line tests If column is in the dataframe
        #The second line runs the "clean_column" function on the dataframe using the label in the "columns_to_clean" list
    #If it doesnt work, it prints a statement that says it didn't work 

for column in column_list:
    if column in openfield_file.columns:
        openfield_file[column] = clean_column(openfield_file[column])
    else:
        print(f"Warning: Column '{column}' not found in the DataFrame")

#Let's see if it worked
openfield_file.head()

In [None]:
#Lets reassign any columns with data that is numeric back to integer type since we are not working with floats.


columns_to_convert_to_int = ['column1',
    'column2',
    'column3'
    ]

#Iterate over the items in the list
for column in columns_to_convert_to_int:
    openfield_file[column] = pandas.to_numeric(openfield_file[column], errors='coerce').astype('Int64')
    
#Let's make sure it worked
openfield_file.info()

In [None]:

#Let's make sure the previous code worked
openfield_file.head()

In [None]:

#Now save your new file to a new path
openfield_file.to_csv('new_file_path')