# Cleaning Public Support data

## Importing necessary libraries

In [1]:
pip install pandas matplotlib seaborn wordcloud

Collecting pandas
  Using cached pandas-1.5.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
Collecting matplotlib
  Using cached matplotlib-3.6.2-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (9.4 MB)
Collecting seaborn
  Using cached seaborn-0.12.1-py3-none-any.whl (288 kB)
Collecting wordcloud
  Using cached wordcloud-1.8.2.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (458 kB)
Collecting numpy>=1.20.3
  Using cached numpy-1.23.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
Collecting pytz>=2020.1
  Using cached pytz-2022.6-py2.py3-none-any.whl (498 kB)
Collecting kiwisolver>=1.0.1
  Using cached kiwisolver-1.4.4-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.2 MB)
Collecting fonttools>=4.22.0
  Using cached fonttools-4.38.0-py3-none-any.whl (965 kB)
Collecting pillow>=6.2.0
  Using cached Pillow-9.3.0-cp38-cp38-manylinux_2_28_x86_64.whl (3.3 MB)
Collecting cycler>=0.10
  Using cached cycler-0.11.0-py3-none-a

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime,date, timedelta

## Loading the data

In [3]:
slack = pd.read_csv('../sources/support-channels.csv')


## Discover

In [4]:
print('Shape of slack dataframe before cleaning:', slack.shape)

Shape of slack dataframe before cleaning: (481, 14)


In [5]:
slack.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 481 entries, 0 to 480
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Channel_ID        481 non-null    object
 1   Channel_Slug      481 non-null    object
 2   Timestamp         481 non-null    object
 3   Timestamp_Thread  368 non-null    object
 4   User_ID           481 non-null    object
 5   Full_Name         470 non-null    object
 6   Email             481 non-null    object
 7   Permalink         481 non-null    object
 8   Text              481 non-null    object
 9   Text_raw          481 non-null    object
 10  Slack_username    481 non-null    object
 11  Team_ID           481 non-null    object
 12  Team_Name         481 non-null    object
 13  Is_Bot            481 non-null    bool  
dtypes: bool(1), object(13)
memory usage: 49.4+ KB


**Creating 2 new columns**

In [6]:
slack['Is_a_question'] = np.where(slack['Timestamp_Thread'].isnull(), 1, 0)

In [7]:
support_agents = ['1','5301']

slack['Is_agent']= np.where(slack['User_ID'].isin(support_agents),1,0)

**Converting timestamp columns**

In [8]:
slack['Datetime'] = pd.to_datetime(slack['Timestamp'])
slack['Datetime_Thread'] = pd.to_datetime(slack['Timestamp_Thread'])

**Creating 2 dataframes: questions and answers**

In [9]:
questions_df = slack[slack['Is_a_question'] == 1]
answers_df = slack[slack['Is_a_question'] == 0]

In [10]:
answers = answers_df.groupby(['Channel_ID','User_ID','Datetime'])[['Text']]

In [11]:
df3 = pd.DataFrame(answers.sum().reset_index())

In [12]:
df3.head()

Unnamed: 0,Channel_ID,User_ID,Datetime,Text
0,CAZ9W99U4,U01KGAER1TM,2022-11-04 17:02:51,No se quiere usar un tercero para las fotos
1,CAZ9W99U4,U01KGAER1TM,2022-11-04 17:04:57,digamos que son imagenes de usuarios en donde ...
2,CAZ9W99U4,U01KGAER1TM,2022-11-04 17:05:02,comentarios y likes
3,CAZ9W99U4,U01KGAER1TM,2022-11-04 17:06:30,y la base de datos no solo contiene las imagen...
4,CAZ9W99U4,U01KGAER1TM,2022-11-04 17:26:59,https://isn365.com/


In [13]:
df3['difference'] = (df3.sort_values('Datetime').groupby('User_ID').Datetime.diff())

In [14]:
df3['difference'] = df3['difference'].fillna(pd.Timedelta(seconds=0))

In [16]:
df3['difference']=df3['difference']/np.timedelta64(1,'s')

In [19]:
df3.rename(columns = {'difference':'diff_in_seconds'}, inplace = True)
   

In [20]:
df3.head(15)

Unnamed: 0,Channel_ID,User_ID,Datetime,Text,diff_in_seconds
0,CAZ9W99U4,U01KGAER1TM,2022-11-04 17:02:51,No se quiere usar un tercero para las fotos,0.0
1,CAZ9W99U4,U01KGAER1TM,2022-11-04 17:04:57,digamos que son imagenes de usuarios en donde ...,126.0
2,CAZ9W99U4,U01KGAER1TM,2022-11-04 17:05:02,comentarios y likes,5.0
3,CAZ9W99U4,U01KGAER1TM,2022-11-04 17:06:30,y la base de datos no solo contiene las imagen...,88.0
4,CAZ9W99U4,U01KGAER1TM,2022-11-04 17:26:59,https://isn365.com/,1229.0
5,CAZ9W99U4,U01KGAER1TM,2022-11-04 17:27:20,"este es el sitio, se quiere hacer como un wall...",21.0
6,CAZ9W99U4,U01KGAER1TM,2022-11-04 17:34:48,Es que lo que se quiere es tener las imagenes ...,448.0
7,CAZ9W99U4,U01KGAER1TM,2022-11-04 17:38:07,Voy a explorar la posibilidad de guardar las i...,199.0
8,CAZ9W99U4,U01KGAER1TM,2022-11-04 17:43:00,"Por otro lado que es mas barato, el host o la BD",293.0
9,CAZ9W99U4,U01KGAER1TM,2022-11-04 17:43:39,creo que por ese lado es una buena opcion porq...,39.0


In [17]:
# Timestamps satisfying given condition
'''
for i in range(len(df3)):
    for x in df3['User_ID]:
        if (df3['diff_in_seconds'][i] < 300
            df3['Text'][i-1] + ' ' + df3['Text'][i]) 
'''

"\nfor i in range(len(df3['User_ID'])):\n    if (df3['difference'][i+1] - df3['difference'][i])<\n        print(df['new_time'][i])\n"

**Number of interactions per student**

**Number of questions per student**

## Data Cleaning

**Encoding boolean column**

**Cleaning joined dataframe**